1. Data Load
1.1 Merge Data
2. EDA (Elementary Data Analysis)
2.1 Structure of the Data
2.2 Descriptive Stats
2.3 Data Imbalance Check
2.4 Check for missing values
2.5 Feature Analysis
2.5.1 Categorical Feature Analysis
2.5.2 Numerical Features Analysis
2.6 Correlation Analysis
3. Data Standardization
3.1 Dropping useless features based on EDA and Correlation
3.2 Missing Value Imputation
3.3 Train-Test spit
3.4 Feature Encoding
4. Feature Selection
4.1 For Numerical Data
4.2 For Categorical Data
5. Build Data Matrix for Models
6. TSNE Visualization
7. Models
7.1 Machine Learning Models
7.1.1 Random Model
7.1.1 Logistic Regression (SGD) with Hyperparameter Tuning
7.1.3 Logistic Regression (Sklearn)
7.1.4 KNN
7.1.5 Decision Tree Model
7.1.6 Random Forest Model
7.1.7 XGBoost Model
7.2 Deep Learning Model
7.2.1 Neural Network (ANN) Model
8. Comparison
9. Conclusion
1. To build a high accuracy model for the Binary Classification.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from datetime import date
from datetime import datetime
from scipy import stats
# from scipy.stats import boxcox
from scipy.special import boxcox, inv_boxcox
import math
import start
Initial setup completed.
from helper import *
Helper Imported.
valid_data = pd.read_json (r'validbets.json')
invalid_data = pd.read_json (r'invalidbets.json')
valid_data.head(3)
| _id | stake | type | placedDate | horse | betRate | marketId | IP | eventType | userName | selectionName | marketName | event | averagePriceMatched | status | winnerId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60337ca8cc80710087ff5d8c | 500.000 | BACK | 2021-02-22T09:43:04.686Z | 8776882 | 1.020 | 1.180 | 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8 | Tennis | brovinn | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.020 | WINNER_DECLARED | 8776882 |
| 1 | 60337d5dcc80710087ff5d90 | 2100.000 | LAY | 2021-02-22T09:46:05.458Z | 8776882 | 1.010 | 1.180 | 49.36.123.125 | Tennis | aksash111 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.010 | WINNER_DECLARED | 8776882 |
| 2 | 60337b0a13046300869b42c4 | 25000.000 | LAY | 2021-02-22T09:36:10.372Z | 8776882 | 1.060 | 1.180 | 2405:201:25:d0aa:11b4:2e1c:9999:f32a | Tennis | pinka2 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.050 | WINNER_DECLARED | 8776882 |
print("Valid Data Shape : ",valid_data.shape)
print("No. of Data points : ", valid_data.shape[0])
print("No. of Fetures : ", valid_data.shape[1]-1)
Valid Data Shape : (10000, 16) No. of Data points : 10000 No. of Fetures : 15
invalid_data.head(3)
| _id | stake | type | placedDate | horse | betRate | marketId | IP | eventType | userName | selectionName | marketName | event | averagePriceMatched | status | winnerId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5fa47215305f1f00f02b5ae4 | 783758 | LAY | 2020-11-05T21:43:49.222Z | 127991 | 160.000 | 1.175 | 185.105.2.158 | Soccer | contra11 | AC Milan | Match Odds | AC Milan v Lille | 160.000 | INVALID_BET | 44790 |
| 1 | 5fa014d9d1a79f00d41349df | 4500 | BACK | 2020-11-02T14:16:57.494Z | 22121561 | 1.960 | 1.175 | 106.204.14.80 | Cricket | sirsa3 | Delhi Capitals | Match Odds | Delhi Capitals v Royal Challengers Bangalore | 1.970 | INVALID_BET | 22121561 |
| 2 | 5fa014db5a7a0a00e2868190 | 7000 | BACK | 2020-11-02T14:16:59.763Z | 22121561 | 1.960 | 1.175 | 185.203.122.18 | Cricket | bhush001 | Delhi Capitals | Match Odds | Delhi Capitals v Royal Challengers Bangalore | 1.960 | INVALID_BET | 22121561 |
print("Invalid Data Shape : ",invalid_data.shape)
print("No. of Data points : ", invalid_data.shape[0])
print("No. of Fetures : ", invalid_data.shape[1]-1)
Invalid Data Shape : (66, 16) No. of Data points : 66 No. of Fetures : 15
merged_data = pd.concat([valid_data,invalid_data])
merged_data.head(3)
| _id | stake | type | placedDate | horse | betRate | marketId | IP | eventType | userName | selectionName | marketName | event | averagePriceMatched | status | winnerId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60337ca8cc80710087ff5d8c | 500.000 | BACK | 2021-02-22T09:43:04.686Z | 8776882 | 1.020 | 1.180 | 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8 | Tennis | brovinn | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.020 | WINNER_DECLARED | 8776882 |
| 1 | 60337d5dcc80710087ff5d90 | 2100.000 | LAY | 2021-02-22T09:46:05.458Z | 8776882 | 1.010 | 1.180 | 49.36.123.125 | Tennis | aksash111 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.010 | WINNER_DECLARED | 8776882 |
| 2 | 60337b0a13046300869b42c4 | 25000.000 | LAY | 2021-02-22T09:36:10.372Z | 8776882 | 1.060 | 1.180 | 2405:201:25:d0aa:11b4:2e1c:9999:f32a | Tennis | pinka2 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.050 | WINNER_DECLARED | 8776882 |
print("Data Shape : ",merged_data.shape)
print("No. of Data points : ", merged_data.shape[0])
print("No. of Fetures : ", merged_data.shape[1]-1)
Data Shape : (10066, 16) No. of Data points : 10066 No. of Fetures : 15
heading('Backup copy of Data')
original_data = merged_data.copy()
print("Data Shape : ",original_data.shape)
print("No. of Data points : ", original_data.shape[0])
print("No. of Fetures : ", original_data.shape[1]-1)
------------------- Backup copy of Data ------------------- Data Shape : (10066, 16) No. of Data points : 10066 No. of Fetures : 15
merged_data.to_csv('merge_data.csv', index=False)
merged_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10066 entries, 0 to 65 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 10066 non-null object 1 stake 10066 non-null float64 2 type 10066 non-null object 3 placedDate 10066 non-null object 4 horse 10066 non-null int64 5 betRate 10066 non-null float64 6 marketId 10066 non-null float64 7 IP 10063 non-null object 8 eventType 10066 non-null object 9 userName 10066 non-null object 10 selectionName 10066 non-null object 11 marketName 10066 non-null object 12 event 10066 non-null object 13 averagePriceMatched 10066 non-null float64 14 status 10066 non-null object 15 winnerId 10066 non-null int64 dtypes: float64(4), int64(2), object(10) memory usage: 1.3+ MB
describe(merged_data).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| stake | 10066.000 | 31709.479 | 145600.465 | 9.000 | 400.000 | 2000.000 | 20000.000 | 9300000.000 | 44194.691 | 31.301 | 1723.736 |
| horse | 10066.000 | 4990883.972 | 4913718.664 | 235.000 | 1222344.000 | 4294272.000 | 9628997.250 | 36846168.000 | 3945296.031 | 1.136 | 1.190 |
| betRate | 10066.000 | 2.452 | 13.855 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.951 | 44.571 | 2835.808 |
| marketId | 10066.000 | 1.179 | 0.001 | 1.175 | 1.179 | 1.179 | 1.180 | 1.180 | 0.001 | -1.300 | 0.975 |
| averagePriceMatched | 10066.000 | 2.466 | 13.969 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.977 | 43.882 | 2754.245 |
| winnerId | 10066.000 | 4775882.844 | 4712822.120 | 235.000 | 1221386.000 | 4294272.000 | 9630879.000 | 36846168.000 | 3856511.193 | 1.029 | 0.602 |
stake have high std deviation, it indicates data are more spread out from the mean
stake, betRate, averagePriceMatched have high positive skew value, indicates trail dragging towards the right in their respective distributions.
stake, betRate, averagePriceMatched have high postive kurt value, as trail higher kurtosis corresponds to greater extremity of deviations (or outliers).
merged_data['status'].value_counts()
WINNER_DECLARED 10000 INVALID_BET 66 Name: status, dtype: int64
def pie_labeling(x):
print(x)
return '{:.4f}%\n(#{:.0f})'.format(x, sums.values.sum()*x/100)
from matplotlib.pyplot import pie, axis, show
sums = merged_data['status'].value_counts()
axis('equal')
pie(sums.values, labels=sums.index, autopct=pie_labeling, pctdistance=1.3, labeldistance=1.6)
plt.title("All Data - Dependent Variable ('status') distribution")
plt.show()
99.3443250656128 0.6556725595146418
merged_data.isnull().values.any()
True
missing_values_table(merged_data)
Dataframe has 16 columns. There are 1 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| IP | 3 | 0.000 |
heading("Number of unique values in each feature")
print(merged_data.nunique())
--------------------------------------- Number of unique values in each feature --------------------------------------- _id 10066 stake 376 type 2 placedDate 10063 horse 319 betRate 258 marketId 390 IP 1070 eventType 3 userName 410 selectionName 318 marketName 10 event 245 averagePriceMatched 726 status 2 winnerId 195 dtype: int64
categorical_list = ['type', 'horse', 'marketId', 'IP', 'eventType', 'userName', 'selectionName', 'marketName', 'event','winnerId']
numerical_list = ['stake', 'betRate', 'averagePriceMatched']
datetime_list = ['placedDate']
merged_data['_id'].nunique()
10066
heading("Categorical Features")
print('\n'.join(categorical_list))
-------------------- Categorical Features -------------------- type horse marketId IP eventType userName selectionName marketName event winnerId
merged_data['type'].value_counts()
LAY 5303 BACK 4763 Name: type, dtype: int64
crosstab_by_y_plot(merged_data, 'type', figsize=(6,5))
---------------------------- type grouped by status Count ----------------------------
| status | INVALID_BET | WINNER_DECLARED |
|---|---|---|
| type | ||
| BACK | 35 | 4728 |
| LAY | 31 | 5272 |
print('No. of unique values of horse :', merged_data['horse'].nunique())
No. of unique values of horse : 319
count_plot(merged_data, 'horse')
crosstab_by_y(merged_data, 'horse', transposed=True)
----------------------------- horse grouped by status Count -----------------------------
| horse | 235 | 448 | 1096 | 1117 | 1189 | 1703 | 2426 | 2685 | 7407 | 7461 | 7659 | 9162 | 9163 | 10501 | 10761 | 10774 | 10779 | 13360 | 13362 | 14072 | 16606 | 28191 | 28214 | 28220 | 28223 | 37302 | 37303 | 41433 | 44503 | 44504 | 44507 | 44508 | 44518 | 44519 | 44521 | 44526 | 44785 | 44787 | 44790 | 44793 | 44794 | 44795 | 44796 | 44797 | 44798 | 44800 | 46726 | 47972 | 47973 | 47998 | 47999 | 48043 | 48044 | 48224 | 48351 | 48451 | 48461 | 48470 | 48756 | 48759 | 48783 | 48784 | 48785 | 48786 | 48787 | 48793 | 48799 | 49058 | 50347 | 50349 | 51404 | 55190 | 55223 | 55243 | 55264 | 55270 | 55271 | 56036 | 56298 | 56299 | 56301 | 56323 | 56343 | 56363 | 56764 | 56966 | 56967 | 58805 | 58943 | 59044 | 60294 | 60295 | 60297 | 60303 | 60310 | 60443 | 62683 | 62684 | 63347 | 64374 | 64964 | 65352 | 65778 | 66183 | 67143 | 69718 | 69720 | 70385 | 70468 | 77586 | 78864 | 79323 | 79343 | 84649 | 86359 | 113123 | 113125 | 113187 | 113191 | 113239 | 121724 | 127991 | 191604 | 191607 | 198124 | 198136 | 198138 | 199184 | 199545 | 201261 | 201327 | 208035 | 214865 | 215817 | 215821 | 215829 | 247969 | 259394 | 269792 | 298233 | 309111 | 309687 | 309689 | 347774 | 350594 | 361329 | 361706 | 419126 | 476499 | 482032 | 489720 | 495321 | 498560 | 501200 | 505726 | 508773 | 522046 | 522049 | 522054 | 571273 | 674742 | 676464 | 676465 | 676467 | 924268 | 965417 | 968185 | 1029663 | 1088499 | 1205121 | 1205126 | 1221385 | 1221386 | 1222344 | 1222345 | 1222346 | 1222347 | 1254317 | 1485567 | 1485568 | 1485573 | 1557297 | 2009654 | 2013140 | 2047448 | 2080735 | 2081063 | 2249229 | 2250259 | 2250353 | 2255452 | 2257536 | 2263603 | 2263634 | 2312313 | 2312315 | 2469649 | 2487036 | 2506293 | 2542448 | 2542449 | 2810072 | 3158851 | 3186303 | 3237590 | 3258153 | 3630179 | 3691700 | 3809606 | 3954225 | 4294272 | 4294273 | 4297012 | 4638399 | 4729711 | 4822931 | 4855758 | 4859354 | 4864974 | 4943786 | 5045297 | 5071877 | 5168454 | 5304142 | 5340398 | 5626816 | 5774350 | 5851482 | 5851483 | 5875376 | 6480414 | 6516913 | 6555433 | 6847357 | 7414058 | 7418999 | 7445660 | 7594131 | 7640637 | 7659748 | 7671296 | 7797904 | 7928242 | 8173434 | 8196374 | 8226987 | 8243874 | 8257797 | 8258569 | 8284479 | 8326752 | 8443097 | 8444055 | 8587663 | 8698678 | 8700174 | 8750569 | 8776882 | 8781581 | 8784966 | 8838645 | 8842202 | 8842295 | 8859238 | 8908192 | 9128710 | 9193006 | 9198585 | 9220660 | 9624573 | 9630472 | 9630879 | 9631399 | 9631561 | 9632088 | 9635566 | 10071088 | 10372226 | 10460263 | 10472882 | 10513040 | 10756275 | 10782588 | 10782589 | 10782634 | 10782635 | 10791812 | 10885505 | 10943306 | 11204210 | 11285506 | 11313212 | 11510005 | 11772759 | 11881072 | 12686963 | 12742426 | 12819357 | 13052998 | 13441259 | 13659467 | 13734279 | 13834991 | 16081872 | 16149511 | 16149526 | 17162689 | 17980099 | 19249131 | 19924824 | 19924825 | 19924831 | 19924941 | 21067365 | 22121561 | 24301731 | 25215583 | 36700739 | 36846168 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| status | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| INVALID_BET | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 1 |
| WINNER_DECLARED | 0 | 343 | 1 | 1 | 1 | 2 | 6 | 2 | 1 | 0 | 0 | 10 | 2 | 4 | 9 | 12 | 5 | 6 | 0 | 12 | 404 | 18 | 1 | 8 | 4 | 70 | 189 | 7 | 11 | 5 | 7 | 0 | 1 | 7 | 8 | 1 | 1 | 1 | 7 | 1 | 3 | 3 | 1 | 5 | 5 | 2 | 1 | 148 | 163 | 6 | 15 | 4 | 14 | 9 | 15 | 4 | 6 | 1 | 13 | 1 | 4 | 1 | 0 | 2 | 1 | 1 | 3 | 2 | 4 | 2 | 1 | 17 | 13 | 9 | 9 | 6 | 4 | 3 | 1 | 12 | 10 | 13 | 7 | 2 | 13 | 10 | 17 | 267 | 3 | 5 | 9 | 4 | 8 | 5 | 3 | 0 | 2 | 1 | 8 | 6 | 1 | 1 | 1 | 1 | 10 | 1 | 5 | 1 | 3 | 2 | 4 | 2 | 2 | 1 | 0 | 1 | 16 | 1 | 1 | 5 | 4 | 0 | 2 | 11 | 1 | 5 | 10 | 4 | 13 | 1 | 1 | 6 | 3 | 21 | 1 | 3 | 1 | 2 | 1 | 3 | 2 | 5 | 3 | 4 | 2 | 10 | 4 | 6 | 11 | 1 | 3 | 1 | 3 | 18 | 2 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 12 | 23 | 8 | 1 | 2 | 2 | 3 | 3 | 2 | 23 | 38 | 95 | 35 | 2 | 27 | 717 | 17 | 2 | 0 | 5 | 284 | 8 | 2 | 4 | 3 | 171 | 4 | 16 | 1 | 1 | 6 | 8 | 116 | 187 | 1 | 9 | 4 | 1 | 1 | 273 | 5 | 3 | 3 | 3 | 1 | 1 | 120 | 3 | 573 | 1228 | 19 | 1 | 9 | 8 | 7 | 369 | 1 | 17 | 17 | 77 | 1 | 1 | 8 | 50 | 2 | 13 | 49 | 5 | 3 | 4 | 0 | 0 | 18 | 1 | 1 | 5 | 114 | 0 | 0 | 29 | 16 | 13 | 4 | 0 | 73 | 17 | 7 | 2 | 16 | 15 | 1 | 3 | 5 | 6 | 2 | 22 | 2 | 2 | 5 | 7 | 7 | 9 | 9 | 15 | 7 | 11 | 12 | 5 | 2 | 15 | 11 | 2 | 9 | 3 | 24 | 9 | 8 | 1 | 5 | 1 | 8 | 751 | 531 | 412 | 7 | 6 | 1 | 23 | 36 | 24 | 10 | 22 | 38 | 26 | 64 | 2 | 2 | 1 | 1 | 6 | 62 | 0 | 1 | 56 | 138 | 11 | 9 | 16 | 5 | 98 | 1 | 18 | 0 | 4 | 31 | 0 | 0 |
crosstab_by_y_table(merged_data, 'horse')
----------------------------------------------- Top 10 horse by WINNER_DECLARED and INVALID_BET -----------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| horse | |
| 4294273 | 1228 |
| 10782589 | 751 |
| 1254317 | 717 |
| 4294272 | 573 |
| 10782634 | 531 |
| 10782635 | 412 |
| 16606 | 404 |
| 4859354 | 369 |
| 448 | 343 |
| 2009654 | 284 |
| status | INVALID_BET |
|---|---|
| horse | |
| 235 | 13 |
| 1221386 | 7 |
| 8226987 | 3 |
| 7671296 | 3 |
| 58805 | 3 |
| 60443 | 3 |
| 22121561 | 3 |
| 86359 | 3 |
| 47973 | 2 |
| 7461 | 2 |
print("No. of unique marketId :",merged_data['marketId'].nunique())
No. of unique marketId : 390
count_plot(merged_data, 'marketId')
crosstab_by_y_table(merged_data, 'marketId')
-------------------------------------------------- Top 10 marketId by WINNER_DECLARED and INVALID_BET --------------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| marketId | |
| 1.178 | 889 |
| 1.179 | 747 |
| 1.179 | 587 |
| 1.180 | 557 |
| 1.179 | 538 |
| 1.178 | 538 |
| 1.180 | 470 |
| 1.178 | 420 |
| 1.180 | 416 |
| 1.179 | 398 |
| status | INVALID_BET |
|---|---|
| marketId | |
| 1.179 | 14 |
| 1.179 | 5 |
| 1.176 | 5 |
| 1.179 | 3 |
| 1.175 | 3 |
| 1.179 | 3 |
| 1.175 | 3 |
| 1.176 | 2 |
| 1.179 | 2 |
| 1.176 | 2 |
describe(merged_data['marketId'])
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| marketId | 10066.000 | 1.179 | 0.001 | 1.175 | 1.179 | 1.179 | 1.180 | 1.180 | 0.001 | -1.300 | 0.975 |
box_violin_plot(merged_data, 'marketId')
histplot(merged_data, 'marketId')
IP have 3 missing values.
merged_data['IP']
0 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8
1 49.36.123.125
2 2405:201:25:d0aa:11b4:2e1c:9999:f32a
3 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8
4 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8
...
61 106.207.179.134
62 103.200.84.188
63 103.212.156.208
64 124.253.0.211
65 2001:8f8:1a63:d2a1:ac87:4f9c:69c9:509e
Name: IP, Length: 10066, dtype: object
print("No. of unique IP :",merged_data['IP'].nunique())
No. of unique IP : 1070
count_plot(merged_data, 'IP')
We are adding geological information using IP.
This information is gathered using online service, and information is depends upon the service provider.
These features can be useful in prediction if any geological correlation exists.
map_data = pd.read_csv('ip_Details_splitted.csv')
map_data.head(3)
| IP | Details | city | region | country | loc | org | postal | timezone | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 182.64.30.119 | ['Delhi', 'Delhi', 'IN', '28.6519,77.2315', 'AS24560 Bharti Airtel Ltd., Telemedia Services', '1... | Delhi | Delhi | IN | 28.6519,77.2315 | AS24560 Bharti Airtel Ltd., Telemedia Services | 110001 | Asia/Kolkata |
| 1 | 139.5.236.244 | ['Mumbai', 'Maharashtra', 'IN', '19.0728,72.8826', 'AS136334 Vortex Netsol Private Limited', '40... | Mumbai | Maharashtra | IN | 19.0728,72.8826 | AS136334 Vortex Netsol Private Limited | 400070 | Asia/Kolkata |
| 2 | 49.36.123.125 | ['Mumbai', 'Maharashtra', 'IN', '19.0728,72.8826', 'AS55836 Reliance Jio Infocomm Limited', '400... | Mumbai | Maharashtra | IN | 19.0728,72.8826 | AS55836 Reliance Jio Infocomm Limited | 400070 | Asia/Kolkata |
merged_data = pd.merge(merged_data, map_data, left_on= ['IP'],
right_on= ['IP'],
how = 'left')
merged_data.head(3)
| _id | stake | type | placedDate | horse | betRate | marketId | IP | eventType | userName | selectionName | marketName | event | averagePriceMatched | status | winnerId | Details | city | region | country | loc | org | postal | timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60337ca8cc80710087ff5d8c | 500.000 | BACK | 2021-02-22T09:43:04.686Z | 8776882 | 1.020 | 1.180 | 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8 | Tennis | brovinn | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.020 | WINNER_DECLARED | 8776882 | ['Delhi', 'Delhi', 'IN', '28.6519,77.2315', 'AS45609 Bharti Airtel Ltd. AS for GPRS Service', '1... | Delhi | Delhi | IN | 28.6519,77.2315 | AS45609 Bharti Airtel Ltd. AS for GPRS Service | 110001 | Asia/Kolkata |
| 1 | 60337d5dcc80710087ff5d90 | 2100.000 | LAY | 2021-02-22T09:46:05.458Z | 8776882 | 1.010 | 1.180 | 49.36.123.125 | Tennis | aksash111 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.010 | WINNER_DECLARED | 8776882 | ['Mumbai', 'Maharashtra', 'IN', '19.0728,72.8826', 'AS55836 Reliance Jio Infocomm Limited', '400... | Mumbai | Maharashtra | IN | 19.0728,72.8826 | AS55836 Reliance Jio Infocomm Limited | 400070 | Asia/Kolkata |
| 2 | 60337b0a13046300869b42c4 | 25000.000 | LAY | 2021-02-22T09:36:10.372Z | 8776882 | 1.060 | 1.180 | 2405:201:25:d0aa:11b4:2e1c:9999:f32a | Tennis | pinka2 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.050 | WINNER_DECLARED | 8776882 | ['Airoli', 'Maharashtra', 'IN', '19.1167,72.9833', 'AS55836 Reliance Jio Infocomm Limited', '400... | Airoli | Maharashtra | IN | 19.1167,72.9833 | AS55836 Reliance Jio Infocomm Limited | 400701 | Asia/Kolkata |
lat_lon_df = merged_data["loc"].str.split(",", n = 1, expand = True)
lat_lon_df.columns = ['Latitude', 'Longitude']
merged_data = pd.concat([merged_data, lat_lon_df], axis=1)
merged_data.head(3)
| _id | stake | type | placedDate | horse | betRate | marketId | IP | eventType | userName | selectionName | marketName | event | averagePriceMatched | status | winnerId | Details | city | region | country | loc | org | postal | timezone | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60337ca8cc80710087ff5d8c | 500.000 | BACK | 2021-02-22T09:43:04.686Z | 8776882 | 1.020 | 1.180 | 2401:4900:30e5:71f1:a86c:d186:5d52:3fa8 | Tennis | brovinn | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.020 | WINNER_DECLARED | 8776882 | ['Delhi', 'Delhi', 'IN', '28.6519,77.2315', 'AS45609 Bharti Airtel Ltd. AS for GPRS Service', '1... | Delhi | Delhi | IN | 28.6519,77.2315 | AS45609 Bharti Airtel Ltd. AS for GPRS Service | 110001 | Asia/Kolkata | 28.6519 | 77.2315 |
| 1 | 60337d5dcc80710087ff5d90 | 2100.000 | LAY | 2021-02-22T09:46:05.458Z | 8776882 | 1.010 | 1.180 | 49.36.123.125 | Tennis | aksash111 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.010 | WINNER_DECLARED | 8776882 | ['Mumbai', 'Maharashtra', 'IN', '19.0728,72.8826', 'AS55836 Reliance Jio Infocomm Limited', '400... | Mumbai | Maharashtra | IN | 19.0728,72.8826 | AS55836 Reliance Jio Infocomm Limited | 400070 | Asia/Kolkata | 19.0728 | 72.8826 |
| 2 | 60337b0a13046300869b42c4 | 25000.000 | LAY | 2021-02-22T09:36:10.372Z | 8776882 | 1.060 | 1.180 | 2405:201:25:d0aa:11b4:2e1c:9999:f32a | Tennis | pinka2 | Danielle Rose Collins | Match Odds | Saisai Zheng v Danielle Rose Collins | 1.050 | WINNER_DECLARED | 8776882 | ['Airoli', 'Maharashtra', 'IN', '19.1167,72.9833', 'AS55836 Reliance Jio Infocomm Limited', '400... | Airoli | Maharashtra | IN | 19.1167,72.9833 | AS55836 Reliance Jio Infocomm Limited | 400701 | Asia/Kolkata | 19.1167 | 72.9833 |
missing_values_table(merged_data)
Dataframe has 26 columns. There are 11 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| postal | 714 | 7.100 |
| Details | 6 | 0.100 |
| city | 6 | 0.100 |
| region | 6 | 0.100 |
| country | 6 | 0.100 |
| loc | 6 | 0.100 |
| org | 6 | 0.100 |
| timezone | 6 | 0.100 |
| Latitude | 6 | 0.100 |
| Longitude | 6 | 0.100 |
| IP | 3 | 0.000 |
for i in ['city', 'region', 'country', 'loc', 'org', 'postal', 'timezone']:
if merged_data[i].nunique() < 10:
count_plot(merged_data, i, size=(6,5))
else:
count_plot(merged_data, i )
display_image('world.png', width=800)
display_image('ip_india.png', width=800)
crosstab_by_y_table(merged_data, 'city')
---------------------------------------------- Top 10 city by WINNER_DECLARED and INVALID_BET ----------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| city | |
| Mumbai | 2509 |
| Delhi | 1682 |
| Airoli | 982 |
| Ludhiāna | 716 |
| Dubai | 685 |
| Jaipur | 419 |
| Surat | 392 |
| Gurgaon | 292 |
| Ahmedabad | 283 |
| Kolkata | 190 |
| status | INVALID_BET |
|---|---|
| city | |
| Delhi | 16 |
| Jaipur | 8 |
| Ludhiāna | 6 |
| Dubai | 6 |
| Hyderabad | 4 |
| Mumbai | 3 |
| Airoli | 3 |
| Guntur | 2 |
| Mohali | 2 |
| Bengaluru | 2 |
merged_data['eventType'].value_counts()
Cricket 6788 Soccer 1885 Tennis 1393 Name: eventType, dtype: int64
count_plot(merged_data, 'eventType', size=(6,5))
crosstab_by_y_plot(merged_data, 'eventType', figsize=(6,5))
--------------------------------- eventType grouped by status Count ---------------------------------
| status | INVALID_BET | WINNER_DECLARED |
|---|---|---|
| eventType | ||
| Cricket | 32 | 6756 |
| Soccer | 30 | 1855 |
| Tennis | 4 | 1389 |
merged_data['userName'].nunique()
410
count_plot(merged_data, 'userName')
crosstab_by_y_table(merged_data,'userName', top=20)
-------------------------------------------------- Top 20 userName by WINNER_DECLARED and INVALID_BET --------------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| userName | |
| soni111 | 489 |
| ab15 | 424 |
| aksash111 | 424 |
| angel18 | 307 |
| drgplay12 | 182 |
| mp111 | 177 |
| rock115 | 163 |
| rk50 | 156 |
| anikakan0101 | 155 |
| springplay12 | 155 |
| ab12 | 153 |
| ss14ss | 148 |
| base04 | 137 |
| badal101 | 134 |
| brovinn | 131 |
| pinka2 | 123 |
| rk05 | 119 |
| jeet7 | 119 |
| joy111 | 118 |
| pk | 103 |
| status | INVALID_BET |
|---|---|
| userName | |
| broraj1 | 3 |
| jk100p | 3 |
| ludo02 | 3 |
| ashu61 | 3 |
| chomo111 | 2 |
| soni111 | 2 |
| bhush001 | 2 |
| sony453 | 2 |
| raja90 | 2 |
| ri786 | 2 |
| goa07 | 2 |
| pd | 2 |
| rok100 | 2 |
| rohan21ss | 2 |
| jai010 | 2 |
| ankush | 2 |
| sony190 | 2 |
| ask09 | 1 |
| ss16ss | 1 |
| sony501 | 1 |
word_list = []
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
for val in merged_data.userName:
val = str(val).strip().replace(" ", "_")
word_list.append(val)
# Converts each token into lowercase
for i in range(len(word_list)):
word_list[i] = word_list[i].lower()
wc = ' '.join(str(e) for e in word_list)
wordcloud = WordCloud(width = 800, height = 800,
background_color ='black',
min_font_size = 10).generate(wc)
# plot the WordCloud image
plt.figure(figsize = (6, 6), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.title("WordCloud for UserName")
plt.show()
merged_data.selectionName.nunique()
318
count_plot(merged_data, 'selectionName')
crosstab_by_y_table(merged_data, 'selectionName', top=20)
------------------------------------------------------- Top 20 selectionName by WINNER_DECLARED and INVALID_BET -------------------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| selectionName | |
| Dolphins | 1228 |
| Islamabad United | 751 |
| Titans | 717 |
| Lions | 573 |
| Lahore Qalandars | 531 |
| Karachi Kings | 412 |
| Australia | 404 |
| Leeward Islands | 369 |
| New Zealand | 343 |
| Guyana | 284 |
| Trinidad & Tobago | 273 |
| The Draw | 267 |
| No | 189 |
| Barbados | 187 |
| Novak Djokovic | 171 |
| Over 2.5 Goals | 163 |
| Under 2.5 Goals | 148 |
| Multan Sultans | 138 |
| Cape Cobras | 120 |
| Jamaica | 116 |
| status | INVALID_BET |
|---|---|
| selectionName | |
| West Indies | 13 |
| Over 1.5 Goals | 7 |
| The Draw | 6 |
| Canterbury | 3 |
| Hugo Dellien | 3 |
| Sunrisers Hyderabad | 3 |
| Delhi Capitals | 3 |
| Werder Bremen | 2 |
| Pakistan | 2 |
| Over 2.5 Goals | 2 |
| Lions | 2 |
| Northern Knights | 1 |
| Sivasspor | 1 |
| Northeast United | 1 |
| Boavista | 1 |
| Over 7.5 Goals | 1 |
| Granada | 1 |
| Under 2.5 Goals | 1 |
| Reims | 1 |
| Sociedad | 1 |
merged_data['marketName'].unique()
array(['Match Odds', 'Tied Match', 'Over/Under 2.5 Goals',
'Over/Under 3.5 Goals', 'Over/Under 1.5 Goals',
'Over/Under 4.5 Goals', 'Over/Under 5.5 Goals',
'Over/Under 0.5 Goals', 'Over/Under 6.5 Goals',
'Over/Under 7.5 Goals'], dtype=object)
count_plot(merged_data, 'marketName', rotation=20)
crosstab_by_y_plot(merged_data, 'marketName', stacked=False, rotation=90, figsize=(11,5.5))
---------------------------------- marketName grouped by status Count ----------------------------------
| status | INVALID_BET | WINNER_DECLARED |
|---|---|---|
| marketName | ||
| Match Odds | 55 | 9127 |
| Over/Under 0.5 Goals | 0 | 62 |
| Over/Under 1.5 Goals | 7 | 61 |
| Over/Under 2.5 Goals | 3 | 311 |
| Over/Under 3.5 Goals | 0 | 130 |
| Over/Under 4.5 Goals | 0 | 29 |
| Over/Under 5.5 Goals | 0 | 19 |
| Over/Under 6.5 Goals | 0 | 2 |
| Over/Under 7.5 Goals | 1 | 0 |
| Tied Match | 0 | 259 |
merged_data['event'].nunique()
245
count_plot(merged_data, 'event')
crosstab_by_y_table(merged_data, 'event')
----------------------------------------------- Top 10 event by WINNER_DECLARED and INVALID_BET -----------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| event | |
| Islamabad United v Multan Sultans | 967 |
| New Zealand v Australia (1st T20) | 766 |
| Lions v Warriors | 612 |
| Lahore Qalandars v Peshawar Zalmi | 561 |
| Guyana v Trinidad & Tobago | 557 |
| Dolphins v Cape Cobras | 553 |
| Leeward Islands v Jamaica | 470 |
| Warriors v Dolphins | 464 |
| Karachi Kings v Quetta Gladiators | 427 |
| Titans v Knights | 412 |
| status | INVALID_BET |
|---|---|
| event | |
| Bangladesh v West Indies | 14 |
| Pakistan v South Africa | 5 |
| Hapoel Beer Sheva v Nice | 5 |
| Delhi Capitals v Royal Challengers Bangalore | 3 |
| Sunrisers Hyderabad v Mumbai Indians | 3 |
| Central Districts v Canterbury | 3 |
| Casanova v Dellien | 3 |
| US Cremonese v Brescia | 2 |
| Lions v Warriors | 2 |
| Werder Bremen v Schalke 04 | 2 |
merged_data['winnerId'].nunique()
195
count_plot(merged_data, 'winnerId')
crosstab_by_y_table(merged_data, 'winnerId')
-------------------------------------------------- Top 10 winnerId by WINNER_DECLARED and INVALID_BET --------------------------------------------------
| status | WINNER_DECLARED |
|---|---|
| winnerId | |
| 4294273 | 1329 |
| 10782589 | 889 |
| 1254317 | 758 |
| 448 | 747 |
| 2312313 | 672 |
| 4294272 | 587 |
| 2810072 | 557 |
| 10782634 | 538 |
| 58805 | 427 |
| 10782635 | 420 |
| status | INVALID_BET |
|---|---|
| winnerId | |
| 235 | 14 |
| 58805 | 9 |
| 7461 | 5 |
| 1221385 | 5 |
| 22121561 | 3 |
| 86363 | 3 |
| 8226987 | 3 |
| 7671296 | 3 |
| 4294272 | 2 |
| 1221386 | 2 |
describe(merged_data['stake'])
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| stake | 10066.000 | 31709.479 | 145600.465 | 9.000 | 400.000 | 2000.000 | 20000.000 | 9300000.000 | 44194.691 | 31.301 | 1723.736 |
pdf_cdf(merged_data, 'stake', bins=20)
box_violin_plot(merged_data, 'stake')
%%time
sns.histplot(data=merged_data, x='stake', hue='status', kde=True).set_title("Histogram - stake")
plt.show()
Wall time: 28.1 s
count_plot(merged_data, 'stake', top=15, rotation=0)
stake_count = merged_data['stake'].value_counts()
stake_count_idx = merged_data['stake'].value_counts().sort_index()
# stake_count_df = pd.DataFrame({'value':stake_count.index, 'count':stake_count.values})
# stake_count_idx_df = pd.DataFrame({'value':stake_count_idx.index, 'count':stake_count_idx.values})
plt.figure(figsize=(14,5))
ax= merged_data['stake'].value_counts().sort_index().plot(kind='bar', color= mat_color_list)
axlist = []
for idx, val in zip(stake_count.sort_index().index, stake_count.sort_index().values):
if val > 100:
axlist.append(idx)
else:
axlist.append(None)
for p in ax.patches:
if (p.get_height()<100):
continue
pat = str(p.get_height())
ax.annotate(pat, (p.get_x() * 1.005, p.get_height() * 1.005))
ax.set_xticklabels(axlist)
ax.tick_params(axis ='x', rotation = 70)
plt.title("Top stakes (repeated more than 100 times)")
plt.xlabel("stake")
plt.ylabel("count")
plt.show()
Lets do more analysis.
#calculating 0-100th percentile to find a the correct percentile value for removal of outliers
for i in range(0,100,10):
var =merged_data["stake"].values
var = np.sort(var,axis = None)
print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print ("100 percentile value is ",var[-1])
0 percentile value is 9.0 10 percentile value is 135.0 20 percentile value is 250.0 30 percentile value is 500.0 40 percentile value is 800.0 50 percentile value is 2000.0 60 percentile value is 10000.0 70 percentile value is 10000.0 80 percentile value is 25000.0 90 percentile value is 50000.0 100 percentile value is 9300000.0
#looking further from the 99th percecntile
for i in range(90,100):
var =merged_data["stake"].values
var = np.sort(var,axis = None)
print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print ("100 percentile value is ",var[-1])
90 percentile value is 50000.0 91 percentile value is 60000.0 92 percentile value is 100000.0 93 percentile value is 100000.0 94 percentile value is 100000.0 95 percentile value is 113000.0 96 percentile value is 200000.0 97 percentile value is 200000.0 98 percentile value is 300000.0 99 percentile value is 500000.0 100 percentile value is 9300000.0
#pdf of trip-times after removing the outliers
sns.FacetGrid(merged_data,height=4) \
.map(sns.kdeplot,"stake") \
.add_legend()
plt.title("Distribution of stake")
plt.show();
#converting the values to log-values to chec for log-normal
import math
merged_data['stake_log']=[math.log(i) for i in merged_data['stake'].values]
describe(merged_data[['stake','stake_log']]).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| stake | 10066.000 | 31709.479 | 145600.465 | 9.000 | 400.000 | 2000.000 | 20000.000 | 9300000.000 | 44194.691 | 31.301 | 1723.736 |
| stake_log | 10066.000 | 7.854 | 2.419 | 2.197 | 5.991 | 7.601 | 9.903 | 16.046 | 2.118 | 0.204 | -0.916 |
pdf_cdf(merged_data, 'stake_log')
box_violin_plot(merged_data, 'stake_log')
#pdf of trip-times after removing the outliers
sns.FacetGrid(merged_data,height=4) \
.map(sns.kdeplot,"stake_log",hue=merged_data['status']) \
.add_legend()
plt.title("Distribution of stake log")
plt.show();
print(qqPlot.__doc__)
a Q–Q (Quantile-Quantile) plot is a probability plot,
which is a graphical method for comparing two probability distributions
by plotting their quantiles against each other.
use : qqPlot (df, feat)
df == dataframe name
feature == feature name
qqPlot(merged_data, 'stake_log')
---------------------------------------------------------------------------------- Plot QQ Plot to check the stake_log distribution is similar to normal distribution ----------------------------------------------------------------------------------
# transform training data & save lambda value
fitted_data_stake, fitted_lambda_stake = stats.boxcox(merged_data['stake'] + 1) # Add 1 to be able to transform 0 values
fitted_data_stake
array([5.39660167, 6.43430899, 8.07135899, ..., 7.48761883, 8.90847766,
7.02920276])
fitted_lambda_stake
-0.0466345079709477
merged_data['stake']
0 500.000
1 2100.000
2 25000.000
3 300.000
4 500.000
...
10061 500000.000
10062 500000.000
10063 10000.000
10064 100000.000
10065 5000.000
Name: stake, Length: 10066, dtype: float64
inv_boxcox(fitted_data_stake, fitted_lambda_stake)-1
array([ 500., 2100., 25000., ..., 10000., 100000., 5000.])
merged_data['stake_boxcox'] = fitted_data_stake
describe(merged_data[['stake','stake_log','stake_boxcox']]).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| stake | 10066.000 | 31709.479 | 145600.465 | 9.000 | 400.000 | 2000.000 | 20000.000 | 9300000.000 | 44194.691 | 31.301 | 1723.736 |
| stake_log | 10066.000 | 7.854 | 2.419 | 2.197 | 5.991 | 7.601 | 9.903 | 16.046 | 2.118 | 0.204 | -0.916 |
| stake_boxcox | 10066.000 | 6.485 | 1.665 | 2.183 | 5.229 | 6.400 | 7.931 | 11.297 | 1.462 | 0.038 | -0.998 |
pdf_cdf(merged_data, 'stake_boxcox')
box_violin_plot(merged_data, 'stake_boxcox')
histplot(merged_data, 'stake_boxcox')
qqPlot(merged_data, 'stake_boxcox')
------------------------------------------------------------------------------------- Plot QQ Plot to check the stake_boxcox distribution is similar to normal distribution -------------------------------------------------------------------------------------
describe(merged_data['betRate'])
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| betRate | 10066.000 | 2.452 | 13.855 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.951 | 44.571 | 2835.808 |
merged_data['betRate'].nunique()
258
box_violin_plot(merged_data, 'betRate')
count_plot(merged_data, 'betRate')
merged_data['betRate_log'] = np.log(merged_data['betRate'])
describe(merged_data[['betRate','betRate_log']]).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| betRate | 10066.000 | 2.452 | 13.855 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.951 | 44.571 | 2835.808 |
| betRate_log | 10066.000 | 0.425 | 0.515 | 0.010 | 0.166 | 0.336 | 0.513 | 6.908 | 0.270 | 4.969 | 33.523 |
histplot(merged_data, 'betRate_log')
qqPlot(merged_data, 'betRate_log')
------------------------------------------------------------------------------------ Plot QQ Plot to check the betRate_log distribution is similar to normal distribution ------------------------------------------------------------------------------------
# transform training data & save lambda value
fitted_data_betRate, fitted_lambda_betRate = stats.boxcox(merged_data['betRate'])
# fitted_data_betRate
# fitted_lambda_betRate
merged_data['betRate_boxcox'] = fitted_data_betRate
describe(merged_data[['betRate', 'betRate_log', 'betRate_boxcox']]).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| betRate | 10066.000 | 2.452 | 13.855 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.951 | 44.571 | 2835.808 |
| betRate_log | 10066.000 | 0.425 | 0.515 | 0.010 | 0.166 | 0.336 | 0.513 | 6.908 | 0.270 | 4.969 | 33.523 |
| betRate_boxcox | 10066.000 | 0.251 | 0.137 | 0.010 | 0.145 | 0.257 | 0.345 | 0.600 | 0.111 | 0.246 | -0.350 |
box_violin_plot(merged_data, 'betRate_boxcox')
histplot(merged_data,'betRate_boxcox', hue='status', kde=True)
qqPlot(merged_data, 'betRate_boxcox')
--------------------------------------------------------------------------------------- Plot QQ Plot to check the betRate_boxcox distribution is similar to normal distribution ---------------------------------------------------------------------------------------
merged_data['averagePriceMatched'].nunique()
726
describe(merged_data['averagePriceMatched'])
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| averagePriceMatched | 10066.000 | 2.466 | 13.969 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.977 | 43.882 | 2754.245 |
pdf_cdf(merged_data,'averagePriceMatched')
%%time
histplot(merged_data, 'averagePriceMatched')
Wall time: 2min 13s
merged_data['averagePriceMatched_log'] = np.log(merged_data['averagePriceMatched'])
describe(merged_data[['averagePriceMatched','averagePriceMatched_log']]).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| averagePriceMatched | 10066.000 | 2.466 | 13.969 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.977 | 43.882 | 2754.245 |
| averagePriceMatched_log | 10066.000 | 0.426 | 0.518 | 0.010 | 0.166 | 0.336 | 0.513 | 6.908 | 0.272 | 4.952 | 33.256 |
pdf_cdf(merged_data, 'averagePriceMatched_log')
histplot(merged_data, 'averagePriceMatched_log')
qqPlot(merged_data, 'averagePriceMatched_log')
------------------------------------------------------------------------------------------------ Plot QQ Plot to check the averagePriceMatched_log distribution is similar to normal distribution ------------------------------------------------------------------------------------------------
# transform training data & save lambda value
fitted_data_averagePriceMatched, fitted_lambda_averagePriceMatched = stats.boxcox(merged_data['averagePriceMatched'])
# fitted_data_averagePriceMatched
# fitted_lambda_averagePriceMatched
merged_data['averagePriceMatched_boxcox'] = fitted_data_averagePriceMatched
describe(merged_data[['averagePriceMatched', 'averagePriceMatched_log','averagePriceMatched_boxcox']]).T
| count | mean | std | min | 25% | 50% | 75% | max | mad | skew | kurt | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| averagePriceMatched | 10066.000 | 2.466 | 13.969 | 1.010 | 1.180 | 1.400 | 1.670 | 1000.000 | 1.977 | 43.882 | 2754.245 |
| averagePriceMatched_log | 10066.000 | 0.426 | 0.518 | 0.010 | 0.166 | 0.336 | 0.513 | 6.908 | 0.272 | 4.952 | 33.256 |
| averagePriceMatched_boxcox | 10066.000 | 0.251 | 0.137 | 0.010 | 0.145 | 0.257 | 0.345 | 0.599 | 0.112 | 0.249 | -0.352 |
pdf_cdf(merged_data, 'averagePriceMatched_boxcox')
box_violin_plot(merged_data, 'averagePriceMatched_boxcox')
histplot(merged_data, 'averagePriceMatched_boxcox')
qqPlot(merged_data, 'averagePriceMatched_boxcox')
--------------------------------------------------------------------------------------------------- Plot QQ Plot to check the averagePriceMatched_boxcox distribution is similar to normal distribution ---------------------------------------------------------------------------------------------------
merged_data['averagePriceMatched_boxcox'].plot.kde()
<AxesSubplot:ylabel='Density'>
dict_lambda_boxcox = {}
dict_lambda_boxcox['fitted_lambda_stake'] = fitted_lambda_stake
dict_lambda_boxcox['fitted_lambda_betRate'] = fitted_lambda_betRate
dict_lambda_boxcox['fitted_lambda_averagePriceMatched'] = fitted_lambda_averagePriceMatched
dict_lambda_boxcox
{'fitted_lambda_stake': -0.0466345079709477,
'fitted_lambda_betRate': -1.6678009857211833,
'fitted_lambda_averagePriceMatched': -1.668065197571059}
import json
with open('dict_lambda_boxcox.json', 'w') as fp:
json.dump(dict_lambda_boxcox, fp)
merged_data['placedDate']
0 2021-02-22T09:43:04.686Z
1 2021-02-22T09:46:05.458Z
2 2021-02-22T09:36:10.372Z
3 2021-02-22T09:35:32.772Z
4 2021-02-22T09:34:47.648Z
...
10061 2021-02-18T15:39:37.308Z
10062 2021-02-18T15:39:47.201Z
10063 2021-02-18T18:20:17.016Z
10064 2021-02-18T21:01:38.393Z
10065 2021-02-19T21:55:36.789Z
Name: placedDate, Length: 10066, dtype: object
merged_data['placedDate'].dtype
dtype('O')
now = pd.Timestamp('now')
merged_data['placedDate'] = pd.to_datetime(merged_data['placedDate'], format='%Y-%m-%dT%H:%M:%S.%fZ')
merged_data['placedDate'].dtype
dtype('<M8[ns]')
merged_data['placedDate']
0 2021-02-22 09:43:04.686
1 2021-02-22 09:46:05.458
2 2021-02-22 09:36:10.372
3 2021-02-22 09:35:32.772
4 2021-02-22 09:34:47.648
...
10061 2021-02-18 15:39:37.308
10062 2021-02-18 15:39:47.201
10063 2021-02-18 18:20:17.016
10064 2021-02-18 21:01:38.393
10065 2021-02-19 21:55:36.789
Name: placedDate, Length: 10066, dtype: datetime64[ns]
merged_data['placedDate'].nunique()
10063
merged_data['Date'] = merged_data['placedDate'].dt.date
merged_data['Date'].nunique()
27
plt.figure(figsize=(13,4))
ax = sns.scatterplot(data=merged_data, x="Date", y="status", hue='status', style='status')
plt.xticks(list(merged_data['Date'].value_counts().sort_index().index))
ax.tick_params(axis ='x', rotation = 80)
# ax.set_xticklabels(x_ticks, rotation=0, fontsize=8)
# ax.set_yticklabels(y_ticks, rotation=0, fontsize=8)
plt.title("Top stakes (repeated more than 100 times)")
plt.xlabel("Date")
plt.ylabel("count")
# plt.tight_layout()
plt.show()
# merged_data[merged_data['status']=='WINNER_DECLARED']['Date'].value_counts().sort_index()
# merged_data[merged_data['status']=='INVALID_BET']['Date'].value_counts().sort_index()
datei = pd.crosstab(merged_data['Date'],merged_data['status'])
datei.sort_values('WINNER_DECLARED', ascending=False).sort_index().T
| Date | 2020-11-02 | 2020-11-03 | 2020-11-05 | 2020-11-08 | 2020-12-01 | 2020-12-07 | 2020-12-08 | 2020-12-09 | 2020-12-10 | 2021-01-22 | 2021-01-27 | 2021-01-28 | 2021-01-30 | 2021-02-04 | 2021-02-05 | 2021-02-06 | 2021-02-08 | 2021-02-11 | 2021-02-12 | 2021-02-13 | 2021-02-14 | 2021-02-15 | 2021-02-18 | 2021-02-19 | 2021-02-20 | 2021-02-21 | 2021-02-22 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| status | |||||||||||||||||||||||||||
| INVALID_BET | 3 | 3 | 1 | 1 | 1 | 1 | 2 | 2 | 6 | 1 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 1 | 1 | 16 | 2 | 1 | 5 | 1 | 0 | 0 | 0 |
| WINNER_DECLARED | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2221 | 3193 | 3671 | 915 |
plt.figure(figsize=(14,6))
ax = merged_data['Date'].value_counts().sort_index().plot(kind='bar')
add_value_labels(ax)
plt.title('Barplot Date')
plt.xlabel('Date')
plt.ylabel('Count')
plt.show()
merged_data.groupby(['Date'])['status'].value_counts().sort_index()
Date status
2020-11-02 INVALID_BET 3
2020-11-03 INVALID_BET 3
2020-11-05 INVALID_BET 1
2020-11-08 INVALID_BET 1
2020-12-01 INVALID_BET 1
2020-12-07 INVALID_BET 1
2020-12-08 INVALID_BET 2
2020-12-09 INVALID_BET 2
2020-12-10 INVALID_BET 6
2021-01-22 INVALID_BET 1
2021-01-27 INVALID_BET 1
2021-01-28 INVALID_BET 2
2021-01-30 INVALID_BET 3
2021-02-04 INVALID_BET 3
2021-02-05 INVALID_BET 3
2021-02-06 INVALID_BET 3
2021-02-08 INVALID_BET 3
2021-02-11 INVALID_BET 1
2021-02-12 INVALID_BET 1
2021-02-13 INVALID_BET 16
2021-02-14 INVALID_BET 2
2021-02-15 INVALID_BET 1
2021-02-18 INVALID_BET 5
2021-02-19 INVALID_BET 1
WINNER_DECLARED 2221
2021-02-20 WINNER_DECLARED 3193
2021-02-21 WINNER_DECLARED 3671
2021-02-22 WINNER_DECLARED 915
Name: status, dtype: int64
2021-02-19 INVALID_BET 1
WINNER_DECLARED 2221
2021-02-20 WINNER_DECLARED a 3193
2021-02-21 WINNER_DECLARED 3671
2021-02-22 WINNER_DECLARED 915
Most of the bets are placed only in these 4 days.
merged_data['time'] = merged_data['placedDate'].dt.time
merged_data['time']
0 09:43:04.686000
1 09:46:05.458000
2 09:36:10.372000
3 09:35:32.772000
4 09:34:47.648000
...
10061 15:39:37.308000
10062 15:39:47.201000
10063 18:20:17.016000
10064 21:01:38.393000
10065 21:55:36.789000
Name: time, Length: 10066, dtype: object
merged_data['time'].nunique()
10061
merged_data['hour'] = merged_data['placedDate'].dt.hour
merged_data['hour'].nunique()
24
plt.figure(figsize=(14,6))
ax = merged_data['hour'].value_counts().sort_index().plot(kind='bar')
plt.xticks(np.arange(0, 25, 1))
add_value_labels(ax)
plt.xticks(rotation=0)
plt.title("Bar Plot for Hour")
plt.xlabel('hour')
plt.ylabel('Count')
plt.show()
crosstab_by_y_plot(merged_data, 'hour', transposed=True, figsize=(15,6), stacked=False, legend_out=False)
---------------------------- hour grouped by status Count ----------------------------
| hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| status | ||||||||||||||||||||||||
| INVALID_BET | 0 | 0 | 0 | 2 | 1 | 3 | 3 | 3 | 11 | 1 | 1 | 0 | 1 | 1 | 11 | 4 | 3 | 1 | 2 | 8 | 3 | 6 | 1 | 0 |
| WINNER_DECLARED | 8 | 20 | 13 | 14 | 52 | 118 | 302 | 226 | 809 | 798 | 705 | 439 | 595 | 592 | 816 | 960 | 785 | 770 | 622 | 461 | 434 | 300 | 107 | 54 |
merged_data['week_of_the_year'] = merged_data['placedDate'].dt.isocalendar().week
merged_data['week_of_the_year'].value_counts()
7 9092 8 915 6 23 50 11 5 9 45 8 4 6 49 1 3 1 Name: week_of_the_year, dtype: Int64
merged_data['week_of_the_year'].nunique()
9
count_plot(merged_data, 'week_of_the_year', sort_index=True)
crosstab_by_y_plot(merged_data, 'week_of_the_year', stacked=False, figsize=(10, 6),)
---------------------------------------- week_of_the_year grouped by status Count ----------------------------------------
| status | INVALID_BET | WINNER_DECLARED |
|---|---|---|
| week_of_the_year | ||
| 3 | 1 | 0 |
| 4 | 6 | 0 |
| 5 | 9 | 0 |
| 6 | 23 | 0 |
| 7 | 7 | 9085 |
| 8 | 0 | 915 |
| 45 | 8 | 0 |
| 49 | 1 | 0 |
| 50 | 11 | 0 |
merged_data['weekday'] = merged_data['placedDate'].dt.weekday
merged_data['weekday'].nunique()
7
count_plot(merged_data, 'weekday', sort_index=True, size=(10, 5.8),)
crosstab_by_y_plot(merged_data, 'weekday', stacked=False, figsize=(10, 5.6),)
------------------------------- weekday grouped by status Count -------------------------------
| status | INVALID_BET | WINNER_DECLARED |
|---|---|---|
| weekday | ||
| 0 | 8 | 915 |
| 1 | 6 | 0 |
| 2 | 3 | 0 |
| 3 | 18 | 0 |
| 4 | 6 | 2221 |
| 5 | 22 | 3193 |
| 6 | 3 | 3671 |
display_image("timeseries.png",width=2000)
merged_data.to_csv('data_after_EDA.csv',index=False)
Compute pairwise correlation of columns, excluding NA/null values.
corr = merged_data.corr()
plt.figure(figsize=(14,7))
sns.heatmap(corr, cmap='Blues', annot=True, fmt='.2g',)
plt.show()
heading("Top 20 Absolute Correlations")
print(get_top_abs_correlations(corr, 20))
----------------------------
Top 20 Absolute Correlations
----------------------------
betRate_log averagePriceMatched_log 1.000
betRate_boxcox averagePriceMatched_boxcox 1.000
betRate averagePriceMatched 1.000
stake_log stake_boxcox 1.000
horse winnerId 0.997
betRate_boxcox averagePriceMatched_log 0.909
averagePriceMatched_log averagePriceMatched_boxcox 0.909
betRate_log betRate_boxcox 0.909
averagePriceMatched_boxcox 0.909
horse marketId 0.748
marketId winnerId 0.747
averagePriceMatched betRate_log 0.719
averagePriceMatched_log 0.718
betRate betRate_log 0.718
averagePriceMatched_log 0.717
stake stake_log 0.655
stake_boxcox 0.641
hour weekday 0.610
marketId weekday 0.519
winnerId betRate_log 0.499
dtype: float64
betRate and averagePriceMatched have correlation of 1.
winnerId have high correlation with horse.
Log Transformed and BoxCox Transformed features have high correlation with corresponding feature. So we must include only one of them in modeling.
stake and their transforms (Log and BoxCox) have high correlation with averagePriceMatched and transforms (Log and BoxCox).
features = list(merged_data.columns.values)
merged_data.nunique()
_id 10066 stake 376 type 2 placedDate 10063 horse 319 betRate 258 marketId 390 IP 1070 eventType 3 userName 410 selectionName 318 marketName 10 event 245 averagePriceMatched 726 status 2 winnerId 195 Details 222 city 85 region 30 country 9 loc 105 org 102 postal 103 timezone 10 Latitude 104 Longitude 105 stake_log 376 stake_boxcox 376 betRate_log 258 betRate_boxcox 258 averagePriceMatched_log 721 averagePriceMatched_boxcox 714 Date 27 time 10061 hour 24 week_of_the_year 9 weekday 7 dtype: int64
merged_data.columns.values
array(['_id', 'stake', 'type', 'placedDate', 'horse', 'betRate',
'marketId', 'IP', 'eventType', 'userName', 'selectionName',
'marketName', 'event', 'averagePriceMatched', 'status', 'winnerId',
'Details', 'city', 'region', 'country', 'loc', 'org', 'postal',
'timezone', 'Latitude', 'Longitude', 'stake_log', 'stake_boxcox',
'betRate_log', 'betRate_boxcox', 'averagePriceMatched_log',
'averagePriceMatched_boxcox', 'Date', 'time', 'hour',
'week_of_the_year', 'weekday'], dtype=object)
# numerical_fet = ['stake', 'stake_log', 'stake_boxcox', 'betRate', 'betRate_log', 'betRate_boxcox',
# 'marketId', 'averagePriceMatched', 'averagePriceMatched_log', 'averagePriceMatched_boxcox']
# categorical_fet = ['type', 'horse', 'eventType', 'userName', 'selectionName', 'marketName', 'event', 'winnerId',
# 'IP_version', 'city', 'region', 'country', 'org', 'timezone',
# 'Date', 'hour', 'week', 'weekday']
# dependent_fet = 'status'
# numerical_fet = ['stake_boxcox','betRate_boxcox','marketId', 'averagePriceMatched_boxcox']
# categorical_fet = ['type', 'eventType', 'marketName', 'event',
# 'IP_version', 'Date', 'hour', 'week', 'weekday']
# hash_fet = ['horse', 'userName', 'selectionName', 'winnerId',]
# dependent_fet = 'status'
# numerical_fet = ['stake_boxcox','betRate_boxcox','marketId', 'averagePriceMatched_boxcox']
# categorical_fet = ['type', 'eventType', 'marketName', 'event',
# 'IP_version', 'Date', 'hour', 'week', 'weekday']
# hash_fet = ['horse', 'userName', 'selectionName', 'winnerId',]
# dependent_fet = 'status'
| Feature Name | Included | Reason |
|---|---|---|
| _id | No | Unique identifier for bets |
| stake | No | stake_boxcox is used. |
| stake_log | No | stake_boxcox is used. |
| stake_boxcox | Yes | |
| type | Yes | |
| horse | Yes | |
| betRate | No | betRate_boxcox is used. |
| betRate_log | No | betRate_boxcox is used. |
| betRate_boxcox | Yes | |
| marketId | No | 390 category |
| eventType | Yes | |
| userName | No | 410 category |
| selectionName | No | 318 category |
| marketName | Yes | |
| event | Yes | |
| averagePriceMatched | No | averagePriceMatched_boxcox is used |
| averagePriceMatched_log | No | averagePriceMatched_boxcox is used |
| averagePriceMatched_boxcox | No | |
| winnerID | No | 195 category |
| IP | No | Unique identifier for IP |
| Details (IP) | No | list of details about IP. |
| city | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| region | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| country | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| loc | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| org | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| postal | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| timezone | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| Lattitude | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| Longitude | No | Dropping here but can be useful for large data where bets have geographical correlation. |
| placedDate | No | Unique Identifier for date and time when bet is placed. |
| Date | No | Since in EDA it show all bets are INVALID_BET before 19 JAN, it will overfit the model since it makes prediction on date. |
| time | No | use in hour. |
| hour | Yes | |
| week_of_the_year | No | Can cause overfitting due to temporal nature of the data. |
| weekday | yes |
numerical_fet = ['stake_boxcox','betRate_boxcox','marketId', 'averagePriceMatched_boxcox']
categorical_fet = ['type', 'eventType', 'marketName', 'event',
'hour', 'week_of_the_year', 'weekday']
dependent_fet = 'status'
merged_data[numerical_fet].head()
| stake_boxcox | betRate_boxcox | marketId | averagePriceMatched_boxcox | |
|---|---|---|---|---|
| 0 | 5.397 | 0.019 | 1.180 | 0.019 |
| 1 | 6.434 | 0.010 | 1.180 | 0.010 |
| 2 | 8.071 | 0.056 | 1.180 | 0.047 |
| 3 | 5.011 | 0.064 | 1.180 | 0.064 |
| 4 | 5.397 | 0.072 | 1.180 | 0.072 |
merged_data[categorical_fet].head(3)
| type | eventType | marketName | event | hour | week_of_the_year | weekday | |
|---|---|---|---|---|---|---|---|
| 0 | BACK | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 |
| 1 | LAY | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 |
| 2 | LAY | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 |
missing_values_table(merged_data[categorical_fet])
Dataframe has 7 columns. There are 0 columns that have missing values.
| Missing Values | % of Total Values |
|---|
new_df = pd.concat([merged_data[numerical_fet],
merged_data[categorical_fet],
merged_data[dependent_fet]], axis=1)
new_df.head(3)
| stake_boxcox | betRate_boxcox | marketId | averagePriceMatched_boxcox | type | eventType | marketName | event | hour | week_of_the_year | weekday | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.397 | 0.019 | 1.180 | 0.019 | BACK | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 | WINNER_DECLARED |
| 1 | 6.434 | 0.010 | 1.180 | 0.010 | LAY | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 | WINNER_DECLARED |
| 2 | 8.071 | 0.056 | 1.180 | 0.047 | LAY | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 | WINNER_DECLARED |
missing_values_table(new_df)
Dataframe has 12 columns. There are 0 columns that have missing values.
| Missing Values | % of Total Values |
|---|
new_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10066 entries, 0 to 10065 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stake_boxcox 10066 non-null float64 1 betRate_boxcox 10066 non-null float64 2 marketId 10066 non-null float64 3 averagePriceMatched_boxcox 10066 non-null float64 4 type 10066 non-null object 5 eventType 10066 non-null object 6 marketName 10066 non-null object 7 event 10066 non-null object 8 hour 10066 non-null int64 9 week_of_the_year 10066 non-null UInt32 10 weekday 10066 non-null int64 11 status 10066 non-null object dtypes: UInt32(1), float64(4), int64(2), object(5) memory usage: 1.3+ MB
print(DataFrameImputer.__doc__)
Impute missing values.
Columns of dtype object are imputed with the most frequent value (mode)
in column.
Columns of other types are imputed with mean of column.
data = DataFrameImputer().fit_transform(new_df)
data.isnull().sum().any()
False
data.head(3)
| stake_boxcox | betRate_boxcox | marketId | averagePriceMatched_boxcox | type | eventType | marketName | event | hour | week_of_the_year | weekday | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.397 | 0.019 | 1.180 | 0.019 | BACK | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 | WINNER_DECLARED |
| 1 | 6.434 | 0.010 | 1.180 | 0.010 | LAY | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 | WINNER_DECLARED |
| 2 | 8.071 | 0.056 | 1.180 | 0.047 | LAY | Tennis | Match Odds | Saisai Zheng v Danielle Rose Collins | 9 | 8 | 0 | WINNER_DECLARED |
data.shape
(10066, 12)
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.3, stratify=data['status'], random_state=1)
print(train.shape)
print(test.shape)
(7046, 12) (3020, 12)
train['status'].value_counts()
WINNER_DECLARED 7000 INVALID_BET 46 Name: status, dtype: int64
test['status'].value_counts()
WINNER_DECLARED 3000 INVALID_BET 20 Name: status, dtype: int64
def pie_labeling_train(x):
print(x)
return '{:.4f}%\n(#{:.0f})'.format(x, sums_train.values.sum()*x/100)
def pie_labeling_test(x):
print(x)
return '{:.4f}%\n(#{:.0f})'.format(x, sums_test.values.sum()*x/100)
fig = plt.figure(figsize=(24,10))
#this line will produce a figure which has 2 row
#and 4 columns
#(0, 0) specifies the left upper coordinate of your plot
ax1 = plt.subplot2grid((2,4),(0,0))
sums_train = train['status'].value_counts()
plt.pie(sums_train, labels=['WINNER_DECLARED', 'INVALID_BET'],autopct=pie_labeling_train, pctdistance=1.3, labeldistance=1.6)
plt.title('Train Dependent Variable ("status") distribution')
#next one
ax1 = plt.subplot2grid((2, 4), (0, 1))
sums_test = test['status'].value_counts()
plt.pie(sums_test, labels=['WINNER_DECLARED', 'INVALID_BET'],autopct=pie_labeling_test, pctdistance=1.3, labeldistance=1.6)
plt.title('Test Dependent Variable ("status") distribution')
plt.tight_layout()
plt.show()
99.34714436531067 0.652852701023221 99.33775067329407 0.6622516550123692
train = train.replace({'status': {'WINNER_DECLARED':0, 'INVALID_BET':1}})
test = test.replace({'status': {'WINNER_DECLARED':0, 'INVALID_BET':1}})
y_train = train['status'].copy()
X_train = train.drop("status",axis=1).copy()
print(X_train.shape)
print(y_train.shape)
(7046, 11) (7046,)
y_test=test['status'].copy()
X_test=test.drop("status",axis=1).copy()
print(X_test.shape)
print(y_test.shape)
(3020, 11) (3020,)
for i in train.columns:
print("{:<40}{:>20}".format(i,train[i].nunique()))
stake_boxcox 315 betRate_boxcox 249 marketId 352 averagePriceMatched_boxcox 573 type 2 eventType 3 marketName 10 event 228 hour 24 week_of_the_year 7 weekday 7 status 2
X_train_numerical=X_train[numerical_fet].copy()
X_test_numerical=X_test[numerical_fet].copy()
from sklearn.preprocessing import StandardScaler,OneHotEncoder , LabelEncoder ,normalize
scaler = StandardScaler()
scaler.fit(X_train_numerical)
X_train_numerical_std = scaler.transform(X_train_numerical)
X_test_numerical_std = scaler.transform(X_test_numerical)
X_train_numerical_std=pd.DataFrame(X_train_numerical_std,columns=numerical_fet)
X_test_numerical_std=pd.DataFrame(X_test_numerical_std,columns=numerical_fet)
# Checking the values after converting
X_train_numerical_std.head()
| stake_boxcox | betRate_boxcox | marketId | averagePriceMatched_boxcox | |
|---|---|---|---|---|
| 0 | -1.401 | 0.313 | 0.778 | 0.386 |
| 1 | 0.122 | 1.089 | -0.105 | 1.087 |
| 2 | -0.048 | -1.193 | 0.723 | -1.189 |
| 3 | -0.652 | 1.025 | 0.087 | 1.023 |
| 4 | -0.968 | -1.431 | 0.781 | -1.427 |
print("Shape of Standardized X_train: ",X_train_numerical_std.shape)
print("Shape of Standardized X_test: ",X_test_numerical_std.shape)
Shape of Standardized X_train: (7046, 4) Shape of Standardized X_test: (3020, 4)
import joblib
joblib.dump(scaler, 'scaler.joblib')
['scaler.joblib']
X_train_categorical=X_train[categorical_fet].copy()
X_test_categorical=X_test[categorical_fet].copy()
X_train_categorical.head()
| type | eventType | marketName | event | hour | week_of_the_year | weekday | |
|---|---|---|---|---|---|---|---|
| 3134 | BACK | Cricket | Match Odds | Warriors v Dolphins | 13 | 7 | 6 |
| 9132 | LAY | Cricket | Match Odds | Dolphins v Cape Cobras | 13 | 7 | 4 |
| 7384 | LAY | Tennis | Match Odds | Brady v Osaka | 9 | 7 | 5 |
| 2392 | BACK | Soccer | Over/Under 3.5 Goals | Aston Villa v Leicester | 15 | 7 | 6 |
| 3869 | LAY | Tennis | Match Odds | Djokovic v Medvedev | 10 | 7 | 6 |
onehot_encoder = OneHotEncoder(sparse=False, handle_unknown = 'ignore')
X_train_categorical_encoded = onehot_encoder.fit(X_train_categorical)
X_train_categorical_encoded = onehot_encoder.transform(X_train_categorical)
X_test_categorical_encoded = onehot_encoder.transform(X_test_categorical)
# Checking the Encoded Data
X_train_categorical_encoded
array([[1., 0., 1., ..., 0., 0., 1.],
[0., 1., 1., ..., 1., 0., 0.],
[0., 1., 0., ..., 0., 1., 0.],
...,
[0., 1., 0., ..., 1., 0., 0.],
[0., 1., 0., ..., 0., 1., 0.],
[0., 1., 1., ..., 0., 0., 1.]])
print("X_train after One Hot Encoding: ",X_train_categorical_encoded.shape)
print("X_test after One Hot Encoding: ",X_test_categorical_encoded.shape)
X_train after One Hot Encoding: (7046, 281) X_test after One Hot Encoding: (3020, 281)
## Obtaining Feature Names from the Classifier
# list(onehot_encoder.get_feature_names(categorical_fet))
encodedCatColumnNames = list(onehot_encoder.get_feature_names(categorical_fet))
X_train_categorical_encoded=pd.DataFrame(X_train_categorical_encoded,columns=encodedCatColumnNames)
X_test_categorical_encoded=pd.DataFrame(X_test_categorical_encoded,columns=encodedCatColumnNames)
X_train_categorical_encoded.head(3)
| type_BACK | type_LAY | eventType_Cricket | eventType_Soccer | eventType_Tennis | marketName_Match Odds | marketName_Over/Under 0.5 Goals | marketName_Over/Under 1.5 Goals | marketName_Over/Under 2.5 Goals | marketName_Over/Under 3.5 Goals | marketName_Over/Under 4.5 Goals | marketName_Over/Under 5.5 Goals | marketName_Over/Under 6.5 Goals | marketName_Over/Under 7.5 Goals | marketName_Tied Match | event_AAB v Midtjylland | event_AC Horsens v OB | event_AC Milan v Inter | event_Accrington v Shrewsbury | event_Adelaide United v Central Coast Mariners | event_Admira Wacker v LASK Linz | event_Ajax v Sparta Rotterdam | event_Albacete v Sporting Gijon | event_Altmaier v Martin | event_Amiens v Sochaux | event_Arminia Bielefeld v Wolfsburg | event_Arsenal v Man City | event_Ascoli v Salernitana | event_Aston Villa v Leicester | event_Atalanta v Napoli | event_Athletic Bilbao v Villarreal | event_Atletico Madrid v Levante | event_Augsburg v Leverkusen | event_Austria Vienna v SCR Altach | event_Bangladesh v West Indies | event_Barbados v Jamaica | event_Barcelona v Cadiz | event_Barrere v Halys | event_Basaksehir v Trabzonspor | event_Bayern Munich v Arminia Bielefeld | event_Belenenses v CD Nacional Funchal | event_Benevento v Roma | event_Betis v Getafe | event_Boavista v Moreirense | event_Brady v Osaka | event_Braga v Tondela | event_Braunschweig v Jahn Regensburg | event_Brescia v US Cremonese | event_Brest v Lyon | event_Brondby v Vejle | event_Burnley v West Brom | event_Cagliari v Torino | event_Cape Cobras v Titans | event_Cardiff v Preston | event_Casanova v Dellien | event_Central Districts v Canterbury | event_Cerundolo v Se Baez | event_Cerundolo v Th Seyboth Wild | event_Chambly Oise v Auxerre | event_Christina Mchale v Maddison Inglis | event_Cittadella v Reggiana | event_Clezar v Olivo | event_Cori Gauff v Kaja Juvan | event_Coria v Tabilo | event_Corinthians v Vasco da Gama | event_Coventry v Brentford | event_Daniel v Ramanathan | event_Delhi Capitals v Royal Challengers Bangalore | event_Denizlispor v Genclerbirligi | event_Djokovic v Medvedev | event_Dolphins v Cape Cobras | event_Doncaster v Hull | event_Donskoy v S Kwon | event_Dzumhur v Jacquet | event_E King v Celikbilek | event_Eibar v Valladolid | event_Eintracht Frankfurt v Bayern Munich | event_Elche v Eibar | event_Emmen v PEC Zwolle | event_Erzgebirge v Bochum | event_Erzurum BB v Hatayspor | event_Escobedo v Meligeni Rodrigues Alve | event_Espanyol v Sabadell | event_Eupen v KV Oostende | event_FC Koln v Stuttgart | event_FC Twente v Feyenoord | event_FC Voluntari v Arges Pitesti | event_FCSB v Chindia Targoviste | event_FK Krasnodar v Sochi | event_Farense v Benfica | event_Fenerbahce v Goztepe | event_Fiorentina v Spezia | event_Flamengo v Internacional | event_Fortuna Sittard v ADO Den Haag | event_Freiburg v Union Berlin | event_Frosinone v Pescara | event_Fulham v Sheff Utd | event_Gaio v Donskoy | event_Galle Gladiators v Kandy Tuskers | event_Galloway v Vla Orlov | event_Gaziantep FK v Goztepe | event_Genk v KFCO Beerschot Wilrijk | event_Genoa v Verona | event_Gil Vicente v Santa Clara | event_Gillingham v Bristol Rovers | event_Girona v CD Castellon | event_Guyana v Trinidad & Tobago | event_Hapoel Beer Sheva v Nice | event_Heerenveen v FC Groningen | event_Hertha Berlin v RB Leipzig | event_Hoang v Van Assche | event_Hoffenheim v Werder Bremen | event_Huddersfield v Swansea | event_Huesca v Granada | event_Ilkel v Brooksby | event_Ipswich v Oxford Utd | event_Islamabad United v Multan Sultans | event_Jaziri v J Smith | event_Jua Varillas v Descotte | event_Jua Varillas v Manuel Cerundolo | event_Jung v Cressy | event_Karachi Kings v Quetta Gladiators | event_Karlsruhe v Nurnberg | event_Kasimpasa v Fatih Karagumruk Istanbul | event_Kerala Blasters FC v Jamshedpur FC | event_Knights v Dolphins | event_Kuzmanov v Miedler | event_Kwiatkowski v Bambridge | event_L Broady v Menendez-Maceiras | event_L Harris v Adrian Andreev | event_L Mayer v To Etcheverry | event_LR Vicenza Virtus v Spal | event_Lahore Qalandars v Peshawar Zalmi | event_Las Palmas v FC Cartagena | event_Lazio v Sampdoria | event_Le Havre v Dunkerque | event_Leeward Islands v Jamaica | event_Lions v Warriors | event_Liverpool v Everton | event_Lorient v Lille | event_Lugo v UD Logrones | event_Macarthur FC v Western Sydney Wanderers | event_Madison Brengle v Ellen Perez | event_Majchrzak v Lacko | event_Malaga v Rayo Vallecano | event_Malatyaspor v Konyaspor | event_Mallorca v Almeria | event_Malmo FF v Vasteras SK | event_Man Utd v Newcastle | event_Manuel Cerundolo v Tenti | event_Marchenko v Seppi | event_Medvedev v Tsitsipas | event_Menezes v Gu Justo | event_Mgladbach v Mainz | event_Miedler v Brooksby | event_Millwall v Wycombe | event_Misaki Doi v Liudmila Samsonova | event_Montpellier v Rennes | event_Mumbai City FC v Northeast United | event_Musetti v Gulbis | event_Nancy v Grenoble | event_Nantes v Marseille | event_Napoli v Juventus | event_New Zealand v Australia (1st T20) | event_Nice v Metz | event_Nimes v Bordeaux | event_Niort v Pau | event_Norwich v Rotherham | event_Nottm Forest v Blackburn | event_Olympiakos v Aris | event_Ornago v Bega | event_Otago v Northern Knights | event_PAS Giannina v OFI | event_PSV v Vitesse Arnhem | event_Pacos Ferreira v Guimaraes | event_Paderborn v SV Sandhausen | event_Pakistan v South Africa | event_Paris FC v Chateauroux | event_Paris St-G v Monaco | event_Parma v Udinese | event_Pisa v Empoli | event_Porto v Boavista | event_Portsmouth v Blackpool | event_Preston v Middlesbrough | event_QPR v Bournemouth | event_RKC Waalwijk v Heracles | event_Randers v FC Nordsjaelland | event_Red Bull Salzburg v Rapid Vienna | event_Reggina v Pordenone | event_Reims v Lens | event_Rio Ave v Famalicao | event_Ro Hobbs v J Smith | event_Rochdale v Plymouth | event_Rodez v Toulouse | event_Ross Co v Celtic | event_Royal Mouscron-Peruwelz v Cercle Brugge | event_S Kwon v Maden | event_Saisai Zheng v Danielle Rose Collins | event_Sassuolo v Bologna | event_Schalke 04 v Dortmund | event_Se Baez v Martin | event_Seppi v Musetti | event_Sevilla v Getafe | event_Sheff Wed v Birmingham | event_Shelby Rogers v Veronika Kudermetova | event_Sivasspor v Antalyaspor | event_Sivasspor v Kayserispor | event_Sociedad v Alaves | event_Sociedad v Man Utd | event_Southampton v Chelsea | event_Sport Recife v Atletico MG | event_Sporting Lisbon v Portimonense | event_St Etienne v Reims | event_St Pauli v SV Darmstadt | event_Stakhovsky v Zapata Miralles | event_Stoke v Luton | event_Storm Sanders v Catherine Mcnally | event_Strasbourg v Angers | event_Sunrisers Hyderabad v Mumbai Indians | event_Sydney FC v Brisbane Roar | event_T Griekspoor v Carlos Alcaraz | event_Tabilo v Cerundolo | event_Titans v Knights | event_US Cremonese v Brescia | event_Universitatea Craiova v Hermannstadt | event_VVV Venlo v Az Alkmaar | event_Valencia v Celta Vigo | event_Valenciennes v ESTAC Troyes | event_Valladolid v Real Madrid | event_Venezia v Entella | event_VfL Osnabruck v FC Heidenheim | event_WSG Wattens v St Polten | event_Warriors v Dolphins | event_Watford v Derby | event_Wellington Phoenix v Western Sydney Wanderers | event_Werder Bremen v Schalke 04 | event_West Ham v Tottenham | event_Western United v Macarthur FC | event_Willem II v FC Utrecht | event_Wolves v Leeds | event_Wurzburger Kickers v Hamburger SV | event_Yellow-Red Mechelen v Gent | event_Zulte-Waregem v Standard | hour_0 | hour_1 | hour_2 | hour_3 | hour_4 | hour_5 | hour_6 | hour_7 | hour_8 | hour_9 | hour_10 | hour_11 | hour_12 | hour_13 | hour_14 | hour_15 | hour_16 | hour_17 | hour_18 | hour_19 | hour_20 | hour_21 | hour_22 | hour_23 | week_of_the_year_4 | week_of_the_year_5 | week_of_the_year_6 | week_of_the_year_7 | week_of_the_year_8 | week_of_the_year_45 | week_of_the_year_50 | weekday_0 | weekday_1 | weekday_2 | weekday_3 | weekday_4 | weekday_5 | weekday_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.000 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 1 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| 2 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
X_train_categorical_encoded.shape
(7046, 281)
import joblib
joblib.dump(onehot_encoder, 'onehot_encoder.joblib')
['onehot_encoder.joblib']
from sklearn.feature_selection import SelectKBest,f_classif,chi2
n = SelectKBest(score_func=f_classif,k='all')
numcols = n.fit(X_train_numerical_std,y_train)
X_train_numerical_std.shape
(7046, 4)
top_fet={}
## https://machinelearningmastery.com/feature-selection-with-categorical-data/
for i in range(len(n.scores_)):
top_fet[numerical_fet[i]]=n.scores_[i]
top_fet = sorted(top_fet.items(), key=lambda x: x[1],reverse=True)
top_fet = dict(top_fet)
for key, value in top_fet.items():
print('{:<40}|{:>20}'.format(key, value))
marketId | 56.77709060822096 averagePriceMatched_boxcox | 53.33128448211988 betRate_boxcox | 50.96758568112809 stake_boxcox | 13.844882091182718
plt.figure(figsize=(8, 4))
sns.barplot(x=numcols.scores_, y=numerical_fet, color = 'b')
plt.title('Best Numerical Features')
plt.show()
c = SelectKBest(score_func=chi2)
numcols=c.fit(X_train_categorical_encoded,y_train)
top_fet={}
## https://machinelearningmastery.com/feature-selection-with-categorical-data/
for i in range(len(c.scores_)):
top_fet[encodedCatColumnNames[i]]=c.scores_[i]
top_fet = sorted(top_fet.items(), key=lambda x: x[1],reverse=True)
top_fet = dict(top_fet)
# for key, value in top_fet.items():
# print('{:<40}|{:>20}'.format(key, value))
aa = pd.DataFrame(zip(*sorted(zip(numcols.scores_, encodedCatColumnNames), reverse=True))).T
print('Top 10 Categorial Feature')
aa.columns = ['score', 'name']
aa.head(10)
Top 10 Categorial Feature
| score | name | |
|---|---|---|
| 0 | 2891.304 | week_of_the_year_6 |
| 1 | 1978.261 | weekday_3 |
| 2 | 1673.913 | event_Bangladesh v West Indies |
| 3 | 1369.565 | week_of_the_year_5 |
| 4 | 760.870 | week_of_the_year_50 |
| 5 | 760.870 | week_of_the_year_4 |
| 6 | 608.696 | event_Pakistan v South Africa |
| 7 | 456.522 | week_of_the_year_45 |
| 8 | 456.522 | event_Central Districts v Canterbury |
| 9 | 456.522 | event_Casanova v Dellien |
plt.figure(figsize=(7,10))
sns.barplot(x=aa['score'].iloc[:30],y=aa['name'][:30], color='b')
plt.title('Top 50 Categorical Features')
plt.show()
X_train_merged = pd.concat([X_train_numerical_std,X_train_categorical_encoded], axis=1)
X_test_merged = pd.concat([X_test_numerical_std,X_test_categorical_encoded], axis=1)
X_train_merged.head()
| stake_boxcox | betRate_boxcox | marketId | averagePriceMatched_boxcox | type_BACK | type_LAY | eventType_Cricket | eventType_Soccer | eventType_Tennis | marketName_Match Odds | marketName_Over/Under 0.5 Goals | marketName_Over/Under 1.5 Goals | marketName_Over/Under 2.5 Goals | marketName_Over/Under 3.5 Goals | marketName_Over/Under 4.5 Goals | marketName_Over/Under 5.5 Goals | marketName_Over/Under 6.5 Goals | marketName_Over/Under 7.5 Goals | marketName_Tied Match | event_AAB v Midtjylland | event_AC Horsens v OB | event_AC Milan v Inter | event_Accrington v Shrewsbury | event_Adelaide United v Central Coast Mariners | event_Admira Wacker v LASK Linz | event_Ajax v Sparta Rotterdam | event_Albacete v Sporting Gijon | event_Altmaier v Martin | event_Amiens v Sochaux | event_Arminia Bielefeld v Wolfsburg | event_Arsenal v Man City | event_Ascoli v Salernitana | event_Aston Villa v Leicester | event_Atalanta v Napoli | event_Athletic Bilbao v Villarreal | event_Atletico Madrid v Levante | event_Augsburg v Leverkusen | event_Austria Vienna v SCR Altach | event_Bangladesh v West Indies | event_Barbados v Jamaica | event_Barcelona v Cadiz | event_Barrere v Halys | event_Basaksehir v Trabzonspor | event_Bayern Munich v Arminia Bielefeld | event_Belenenses v CD Nacional Funchal | event_Benevento v Roma | event_Betis v Getafe | event_Boavista v Moreirense | event_Brady v Osaka | event_Braga v Tondela | event_Braunschweig v Jahn Regensburg | event_Brescia v US Cremonese | event_Brest v Lyon | event_Brondby v Vejle | event_Burnley v West Brom | event_Cagliari v Torino | event_Cape Cobras v Titans | event_Cardiff v Preston | event_Casanova v Dellien | event_Central Districts v Canterbury | event_Cerundolo v Se Baez | event_Cerundolo v Th Seyboth Wild | event_Chambly Oise v Auxerre | event_Christina Mchale v Maddison Inglis | event_Cittadella v Reggiana | event_Clezar v Olivo | event_Cori Gauff v Kaja Juvan | event_Coria v Tabilo | event_Corinthians v Vasco da Gama | event_Coventry v Brentford | event_Daniel v Ramanathan | event_Delhi Capitals v Royal Challengers Bangalore | event_Denizlispor v Genclerbirligi | event_Djokovic v Medvedev | event_Dolphins v Cape Cobras | event_Doncaster v Hull | event_Donskoy v S Kwon | event_Dzumhur v Jacquet | event_E King v Celikbilek | event_Eibar v Valladolid | event_Eintracht Frankfurt v Bayern Munich | event_Elche v Eibar | event_Emmen v PEC Zwolle | event_Erzgebirge v Bochum | event_Erzurum BB v Hatayspor | event_Escobedo v Meligeni Rodrigues Alve | event_Espanyol v Sabadell | event_Eupen v KV Oostende | event_FC Koln v Stuttgart | event_FC Twente v Feyenoord | event_FC Voluntari v Arges Pitesti | event_FCSB v Chindia Targoviste | event_FK Krasnodar v Sochi | event_Farense v Benfica | event_Fenerbahce v Goztepe | event_Fiorentina v Spezia | event_Flamengo v Internacional | event_Fortuna Sittard v ADO Den Haag | event_Freiburg v Union Berlin | event_Frosinone v Pescara | event_Fulham v Sheff Utd | event_Gaio v Donskoy | event_Galle Gladiators v Kandy Tuskers | event_Galloway v Vla Orlov | event_Gaziantep FK v Goztepe | event_Genk v KFCO Beerschot Wilrijk | event_Genoa v Verona | event_Gil Vicente v Santa Clara | event_Gillingham v Bristol Rovers | event_Girona v CD Castellon | event_Guyana v Trinidad & Tobago | event_Hapoel Beer Sheva v Nice | event_Heerenveen v FC Groningen | event_Hertha Berlin v RB Leipzig | event_Hoang v Van Assche | event_Hoffenheim v Werder Bremen | event_Huddersfield v Swansea | event_Huesca v Granada | event_Ilkel v Brooksby | event_Ipswich v Oxford Utd | event_Islamabad United v Multan Sultans | event_Jaziri v J Smith | event_Jua Varillas v Descotte | event_Jua Varillas v Manuel Cerundolo | event_Jung v Cressy | event_Karachi Kings v Quetta Gladiators | event_Karlsruhe v Nurnberg | event_Kasimpasa v Fatih Karagumruk Istanbul | event_Kerala Blasters FC v Jamshedpur FC | event_Knights v Dolphins | event_Kuzmanov v Miedler | event_Kwiatkowski v Bambridge | event_L Broady v Menendez-Maceiras | event_L Harris v Adrian Andreev | event_L Mayer v To Etcheverry | event_LR Vicenza Virtus v Spal | event_Lahore Qalandars v Peshawar Zalmi | event_Las Palmas v FC Cartagena | event_Lazio v Sampdoria | event_Le Havre v Dunkerque | event_Leeward Islands v Jamaica | event_Lions v Warriors | event_Liverpool v Everton | event_Lorient v Lille | event_Lugo v UD Logrones | event_Macarthur FC v Western Sydney Wanderers | event_Madison Brengle v Ellen Perez | event_Majchrzak v Lacko | event_Malaga v Rayo Vallecano | event_Malatyaspor v Konyaspor | event_Mallorca v Almeria | event_Malmo FF v Vasteras SK | event_Man Utd v Newcastle | event_Manuel Cerundolo v Tenti | event_Marchenko v Seppi | event_Medvedev v Tsitsipas | event_Menezes v Gu Justo | event_Mgladbach v Mainz | event_Miedler v Brooksby | event_Millwall v Wycombe | event_Misaki Doi v Liudmila Samsonova | event_Montpellier v Rennes | event_Mumbai City FC v Northeast United | event_Musetti v Gulbis | event_Nancy v Grenoble | event_Nantes v Marseille | event_Napoli v Juventus | event_New Zealand v Australia (1st T20) | event_Nice v Metz | event_Nimes v Bordeaux | event_Niort v Pau | event_Norwich v Rotherham | event_Nottm Forest v Blackburn | event_Olympiakos v Aris | event_Ornago v Bega | event_Otago v Northern Knights | event_PAS Giannina v OFI | event_PSV v Vitesse Arnhem | event_Pacos Ferreira v Guimaraes | event_Paderborn v SV Sandhausen | event_Pakistan v South Africa | event_Paris FC v Chateauroux | event_Paris St-G v Monaco | event_Parma v Udinese | event_Pisa v Empoli | event_Porto v Boavista | event_Portsmouth v Blackpool | event_Preston v Middlesbrough | event_QPR v Bournemouth | event_RKC Waalwijk v Heracles | event_Randers v FC Nordsjaelland | event_Red Bull Salzburg v Rapid Vienna | event_Reggina v Pordenone | event_Reims v Lens | event_Rio Ave v Famalicao | event_Ro Hobbs v J Smith | event_Rochdale v Plymouth | event_Rodez v Toulouse | event_Ross Co v Celtic | event_Royal Mouscron-Peruwelz v Cercle Brugge | event_S Kwon v Maden | event_Saisai Zheng v Danielle Rose Collins | event_Sassuolo v Bologna | event_Schalke 04 v Dortmund | event_Se Baez v Martin | event_Seppi v Musetti | event_Sevilla v Getafe | event_Sheff Wed v Birmingham | event_Shelby Rogers v Veronika Kudermetova | event_Sivasspor v Antalyaspor | event_Sivasspor v Kayserispor | event_Sociedad v Alaves | event_Sociedad v Man Utd | event_Southampton v Chelsea | event_Sport Recife v Atletico MG | event_Sporting Lisbon v Portimonense | event_St Etienne v Reims | event_St Pauli v SV Darmstadt | event_Stakhovsky v Zapata Miralles | event_Stoke v Luton | event_Storm Sanders v Catherine Mcnally | event_Strasbourg v Angers | event_Sunrisers Hyderabad v Mumbai Indians | event_Sydney FC v Brisbane Roar | event_T Griekspoor v Carlos Alcaraz | event_Tabilo v Cerundolo | event_Titans v Knights | event_US Cremonese v Brescia | event_Universitatea Craiova v Hermannstadt | event_VVV Venlo v Az Alkmaar | event_Valencia v Celta Vigo | event_Valenciennes v ESTAC Troyes | event_Valladolid v Real Madrid | event_Venezia v Entella | event_VfL Osnabruck v FC Heidenheim | event_WSG Wattens v St Polten | event_Warriors v Dolphins | event_Watford v Derby | event_Wellington Phoenix v Western Sydney Wanderers | event_Werder Bremen v Schalke 04 | event_West Ham v Tottenham | event_Western United v Macarthur FC | event_Willem II v FC Utrecht | event_Wolves v Leeds | event_Wurzburger Kickers v Hamburger SV | event_Yellow-Red Mechelen v Gent | event_Zulte-Waregem v Standard | hour_0 | hour_1 | hour_2 | hour_3 | hour_4 | hour_5 | hour_6 | hour_7 | hour_8 | hour_9 | hour_10 | hour_11 | hour_12 | hour_13 | hour_14 | hour_15 | hour_16 | hour_17 | hour_18 | hour_19 | hour_20 | hour_21 | hour_22 | hour_23 | week_of_the_year_4 | week_of_the_year_5 | week_of_the_year_6 | week_of_the_year_7 | week_of_the_year_8 | week_of_the_year_45 | week_of_the_year_50 | weekday_0 | weekday_1 | weekday_2 | weekday_3 | weekday_4 | weekday_5 | weekday_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.401 | 0.313 | 0.778 | 0.386 | 1.000 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 1 | 0.122 | 1.089 | -0.105 | 1.087 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| 2 | -0.048 | -1.193 | 0.723 | -1.189 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
| 3 | -0.652 | 1.025 | 0.087 | 1.023 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 4 | -0.968 | -1.431 | 0.781 | -1.427 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
X_test_merged.head()
| stake_boxcox | betRate_boxcox | marketId | averagePriceMatched_boxcox | type_BACK | type_LAY | eventType_Cricket | eventType_Soccer | eventType_Tennis | marketName_Match Odds | marketName_Over/Under 0.5 Goals | marketName_Over/Under 1.5 Goals | marketName_Over/Under 2.5 Goals | marketName_Over/Under 3.5 Goals | marketName_Over/Under 4.5 Goals | marketName_Over/Under 5.5 Goals | marketName_Over/Under 6.5 Goals | marketName_Over/Under 7.5 Goals | marketName_Tied Match | event_AAB v Midtjylland | event_AC Horsens v OB | event_AC Milan v Inter | event_Accrington v Shrewsbury | event_Adelaide United v Central Coast Mariners | event_Admira Wacker v LASK Linz | event_Ajax v Sparta Rotterdam | event_Albacete v Sporting Gijon | event_Altmaier v Martin | event_Amiens v Sochaux | event_Arminia Bielefeld v Wolfsburg | event_Arsenal v Man City | event_Ascoli v Salernitana | event_Aston Villa v Leicester | event_Atalanta v Napoli | event_Athletic Bilbao v Villarreal | event_Atletico Madrid v Levante | event_Augsburg v Leverkusen | event_Austria Vienna v SCR Altach | event_Bangladesh v West Indies | event_Barbados v Jamaica | event_Barcelona v Cadiz | event_Barrere v Halys | event_Basaksehir v Trabzonspor | event_Bayern Munich v Arminia Bielefeld | event_Belenenses v CD Nacional Funchal | event_Benevento v Roma | event_Betis v Getafe | event_Boavista v Moreirense | event_Brady v Osaka | event_Braga v Tondela | event_Braunschweig v Jahn Regensburg | event_Brescia v US Cremonese | event_Brest v Lyon | event_Brondby v Vejle | event_Burnley v West Brom | event_Cagliari v Torino | event_Cape Cobras v Titans | event_Cardiff v Preston | event_Casanova v Dellien | event_Central Districts v Canterbury | event_Cerundolo v Se Baez | event_Cerundolo v Th Seyboth Wild | event_Chambly Oise v Auxerre | event_Christina Mchale v Maddison Inglis | event_Cittadella v Reggiana | event_Clezar v Olivo | event_Cori Gauff v Kaja Juvan | event_Coria v Tabilo | event_Corinthians v Vasco da Gama | event_Coventry v Brentford | event_Daniel v Ramanathan | event_Delhi Capitals v Royal Challengers Bangalore | event_Denizlispor v Genclerbirligi | event_Djokovic v Medvedev | event_Dolphins v Cape Cobras | event_Doncaster v Hull | event_Donskoy v S Kwon | event_Dzumhur v Jacquet | event_E King v Celikbilek | event_Eibar v Valladolid | event_Eintracht Frankfurt v Bayern Munich | event_Elche v Eibar | event_Emmen v PEC Zwolle | event_Erzgebirge v Bochum | event_Erzurum BB v Hatayspor | event_Escobedo v Meligeni Rodrigues Alve | event_Espanyol v Sabadell | event_Eupen v KV Oostende | event_FC Koln v Stuttgart | event_FC Twente v Feyenoord | event_FC Voluntari v Arges Pitesti | event_FCSB v Chindia Targoviste | event_FK Krasnodar v Sochi | event_Farense v Benfica | event_Fenerbahce v Goztepe | event_Fiorentina v Spezia | event_Flamengo v Internacional | event_Fortuna Sittard v ADO Den Haag | event_Freiburg v Union Berlin | event_Frosinone v Pescara | event_Fulham v Sheff Utd | event_Gaio v Donskoy | event_Galle Gladiators v Kandy Tuskers | event_Galloway v Vla Orlov | event_Gaziantep FK v Goztepe | event_Genk v KFCO Beerschot Wilrijk | event_Genoa v Verona | event_Gil Vicente v Santa Clara | event_Gillingham v Bristol Rovers | event_Girona v CD Castellon | event_Guyana v Trinidad & Tobago | event_Hapoel Beer Sheva v Nice | event_Heerenveen v FC Groningen | event_Hertha Berlin v RB Leipzig | event_Hoang v Van Assche | event_Hoffenheim v Werder Bremen | event_Huddersfield v Swansea | event_Huesca v Granada | event_Ilkel v Brooksby | event_Ipswich v Oxford Utd | event_Islamabad United v Multan Sultans | event_Jaziri v J Smith | event_Jua Varillas v Descotte | event_Jua Varillas v Manuel Cerundolo | event_Jung v Cressy | event_Karachi Kings v Quetta Gladiators | event_Karlsruhe v Nurnberg | event_Kasimpasa v Fatih Karagumruk Istanbul | event_Kerala Blasters FC v Jamshedpur FC | event_Knights v Dolphins | event_Kuzmanov v Miedler | event_Kwiatkowski v Bambridge | event_L Broady v Menendez-Maceiras | event_L Harris v Adrian Andreev | event_L Mayer v To Etcheverry | event_LR Vicenza Virtus v Spal | event_Lahore Qalandars v Peshawar Zalmi | event_Las Palmas v FC Cartagena | event_Lazio v Sampdoria | event_Le Havre v Dunkerque | event_Leeward Islands v Jamaica | event_Lions v Warriors | event_Liverpool v Everton | event_Lorient v Lille | event_Lugo v UD Logrones | event_Macarthur FC v Western Sydney Wanderers | event_Madison Brengle v Ellen Perez | event_Majchrzak v Lacko | event_Malaga v Rayo Vallecano | event_Malatyaspor v Konyaspor | event_Mallorca v Almeria | event_Malmo FF v Vasteras SK | event_Man Utd v Newcastle | event_Manuel Cerundolo v Tenti | event_Marchenko v Seppi | event_Medvedev v Tsitsipas | event_Menezes v Gu Justo | event_Mgladbach v Mainz | event_Miedler v Brooksby | event_Millwall v Wycombe | event_Misaki Doi v Liudmila Samsonova | event_Montpellier v Rennes | event_Mumbai City FC v Northeast United | event_Musetti v Gulbis | event_Nancy v Grenoble | event_Nantes v Marseille | event_Napoli v Juventus | event_New Zealand v Australia (1st T20) | event_Nice v Metz | event_Nimes v Bordeaux | event_Niort v Pau | event_Norwich v Rotherham | event_Nottm Forest v Blackburn | event_Olympiakos v Aris | event_Ornago v Bega | event_Otago v Northern Knights | event_PAS Giannina v OFI | event_PSV v Vitesse Arnhem | event_Pacos Ferreira v Guimaraes | event_Paderborn v SV Sandhausen | event_Pakistan v South Africa | event_Paris FC v Chateauroux | event_Paris St-G v Monaco | event_Parma v Udinese | event_Pisa v Empoli | event_Porto v Boavista | event_Portsmouth v Blackpool | event_Preston v Middlesbrough | event_QPR v Bournemouth | event_RKC Waalwijk v Heracles | event_Randers v FC Nordsjaelland | event_Red Bull Salzburg v Rapid Vienna | event_Reggina v Pordenone | event_Reims v Lens | event_Rio Ave v Famalicao | event_Ro Hobbs v J Smith | event_Rochdale v Plymouth | event_Rodez v Toulouse | event_Ross Co v Celtic | event_Royal Mouscron-Peruwelz v Cercle Brugge | event_S Kwon v Maden | event_Saisai Zheng v Danielle Rose Collins | event_Sassuolo v Bologna | event_Schalke 04 v Dortmund | event_Se Baez v Martin | event_Seppi v Musetti | event_Sevilla v Getafe | event_Sheff Wed v Birmingham | event_Shelby Rogers v Veronika Kudermetova | event_Sivasspor v Antalyaspor | event_Sivasspor v Kayserispor | event_Sociedad v Alaves | event_Sociedad v Man Utd | event_Southampton v Chelsea | event_Sport Recife v Atletico MG | event_Sporting Lisbon v Portimonense | event_St Etienne v Reims | event_St Pauli v SV Darmstadt | event_Stakhovsky v Zapata Miralles | event_Stoke v Luton | event_Storm Sanders v Catherine Mcnally | event_Strasbourg v Angers | event_Sunrisers Hyderabad v Mumbai Indians | event_Sydney FC v Brisbane Roar | event_T Griekspoor v Carlos Alcaraz | event_Tabilo v Cerundolo | event_Titans v Knights | event_US Cremonese v Brescia | event_Universitatea Craiova v Hermannstadt | event_VVV Venlo v Az Alkmaar | event_Valencia v Celta Vigo | event_Valenciennes v ESTAC Troyes | event_Valladolid v Real Madrid | event_Venezia v Entella | event_VfL Osnabruck v FC Heidenheim | event_WSG Wattens v St Polten | event_Warriors v Dolphins | event_Watford v Derby | event_Wellington Phoenix v Western Sydney Wanderers | event_Werder Bremen v Schalke 04 | event_West Ham v Tottenham | event_Western United v Macarthur FC | event_Willem II v FC Utrecht | event_Wolves v Leeds | event_Wurzburger Kickers v Hamburger SV | event_Yellow-Red Mechelen v Gent | event_Zulte-Waregem v Standard | hour_0 | hour_1 | hour_2 | hour_3 | hour_4 | hour_5 | hour_6 | hour_7 | hour_8 | hour_9 | hour_10 | hour_11 | hour_12 | hour_13 | hour_14 | hour_15 | hour_16 | hour_17 | hour_18 | hour_19 | hour_20 | hour_21 | hour_22 | hour_23 | week_of_the_year_4 | week_of_the_year_5 | week_of_the_year_6 | week_of_the_year_7 | week_of_the_year_8 | week_of_the_year_45 | week_of_the_year_50 | weekday_0 | weekday_1 | weekday_2 | weekday_3 | weekday_4 | weekday_5 | weekday_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.331 | 0.498 | 0.120 | 0.498 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 1 | -0.170 | -1.694 | 0.778 | -1.689 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 2 | -1.401 | -0.181 | 0.894 | -0.179 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 3 | 2.008 | -1.431 | -1.860 | -1.427 | 1.000 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| 4 | 0.331 | -1.431 | 0.781 | -1.427 | 0.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
print(X_train_merged.shape)
print(X_test_merged.shape)
(7046, 285) (3020, 285)
%%time
from sklearn.manifold import TSNE
tsne2d = TSNE(
n_components=2,
perplexity=30.0, # Default
init='random', # pca
random_state=101,
method='barnes_hut',
n_iter=1000,
verbose=2,
angle=0.5
).fit_transform(X_train_merged)
[t-SNE] Computing 91 nearest neighbors... [t-SNE] Indexed 7046 samples in 0.580s... [t-SNE] Computed neighbors for 7046 samples in 19.485s... [t-SNE] Computed conditional probabilities for sample 1000 / 7046 [t-SNE] Computed conditional probabilities for sample 2000 / 7046 [t-SNE] Computed conditional probabilities for sample 3000 / 7046 [t-SNE] Computed conditional probabilities for sample 4000 / 7046 [t-SNE] Computed conditional probabilities for sample 5000 / 7046 [t-SNE] Computed conditional probabilities for sample 6000 / 7046 [t-SNE] Computed conditional probabilities for sample 7000 / 7046 [t-SNE] Computed conditional probabilities for sample 7046 / 7046 [t-SNE] Mean sigma: 0.810653 [t-SNE] Computed conditional probabilities in 0.952s [t-SNE] Iteration 50: error = 89.7203979, gradient norm = 0.0320344 (50 iterations in 5.392s) [t-SNE] Iteration 100: error = 74.5082397, gradient norm = 0.0058877 (50 iterations in 3.422s) [t-SNE] Iteration 150: error = 71.4324188, gradient norm = 0.0031802 (50 iterations in 3.386s) [t-SNE] Iteration 200: error = 70.0495529, gradient norm = 0.0021631 (50 iterations in 3.812s) [t-SNE] Iteration 250: error = 69.2367172, gradient norm = 0.0017376 (50 iterations in 3.464s) [t-SNE] KL divergence after 250 iterations with early exaggeration: 69.236717 [t-SNE] Iteration 300: error = 2.0232878, gradient norm = 0.0014130 (50 iterations in 3.400s) [t-SNE] Iteration 350: error = 1.4106592, gradient norm = 0.0006168 (50 iterations in 3.309s) [t-SNE] Iteration 400: error = 1.1314011, gradient norm = 0.0003657 (50 iterations in 3.594s) [t-SNE] Iteration 450: error = 0.9787596, gradient norm = 0.0002395 (50 iterations in 3.529s) [t-SNE] Iteration 500: error = 0.8841934, gradient norm = 0.0001754 (50 iterations in 3.442s) [t-SNE] Iteration 550: error = 0.8209775, gradient norm = 0.0001379 (50 iterations in 3.417s) [t-SNE] Iteration 600: error = 0.7766211, gradient norm = 0.0001165 (50 iterations in 3.354s) [t-SNE] Iteration 650: error = 0.7450615, gradient norm = 0.0000984 (50 iterations in 3.325s) [t-SNE] Iteration 700: error = 0.7225347, gradient norm = 0.0000911 (50 iterations in 3.792s) [t-SNE] Iteration 750: error = 0.7064455, gradient norm = 0.0000833 (50 iterations in 3.721s) [t-SNE] Iteration 800: error = 0.6953197, gradient norm = 0.0000790 (50 iterations in 3.729s) [t-SNE] Iteration 850: error = 0.6873947, gradient norm = 0.0000738 (50 iterations in 3.626s) [t-SNE] Iteration 900: error = 0.6811014, gradient norm = 0.0000684 (50 iterations in 3.454s) [t-SNE] Iteration 950: error = 0.6760705, gradient norm = 0.0000677 (50 iterations in 3.354s) [t-SNE] Iteration 1000: error = 0.6714608, gradient norm = 0.0000626 (50 iterations in 3.583s) [t-SNE] KL divergence after 1000 iterations: 0.671461 Wall time: 1min 33s
y_ = y_train.values
df = pd.DataFrame({'x':tsne2d[:,0], 'y':tsne2d[:,1] ,'label':y_})
# draw the plot in appropriate place in the grid
sns.lmplot(data=df, x='x', y='y', hue='label', fit_reg=False, height=6,palette="Set2",)
plt.title("perplexity : {} and max_iter : {}".format(30, 1000))
plt.show()
# This function plots the confusion matrices.
def plot_confusion_matrix(test_y, predict_y):
"""
plot_confusion_matrix(test_y, predict_y)
: plot confusion, precision, recall heatmap.
test_y : ground truth y values.
predict_y : predicted y values.
"""
C = confusion_matrix(test_y, predict_y)
A =(((C.T)/(C.sum(axis=1))).T)
B =(C/C.sum(axis=0))
plt.figure(figsize=(16,4))
labels = [0,1]
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, fmt=".3f", xticklabels=labels, yticklabels=labels,cmap="Blues")
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")
plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, fmt=".3f", xticklabels=labels, yticklabels=labels,cmap="Reds")
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")
plt.subplot(1, 3, 3)
sns.heatmap(A, annot=True, fmt=".3f", xticklabels=labels, yticklabels=labels,cmap="Greens")
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")
plt.show()
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.metrics import confusion_matrix
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.metrics.classification module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API. warnings.warn(message, FutureWarning)
from sklearn.model_selection import GridSearchCV , train_test_split
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score , f1_score , make_scorer
from sklearn.preprocessing import StandardScaler,OneHotEncoder , LabelEncoder ,normalize
from sklearn.feature_selection import SelectKBest,f_classif,chi2
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score,accuracy_score,precision_score,recall_score,f1_score
from sklearn.metrics import confusion_matrix, roc_curve, auc
test_len = len(y_test)
predicted_y = np.zeros((test_len,2))
for i in range(test_len):
rand_probs = np.random.rand(1,2)
predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model : ",log_loss(y_test, predicted_y, eps=1e-15))
print("Accuracy on Test Data using Random Model : ",accuracy_score(y_test, predicted_y[:,1].round()))
predicted_y =np.argmax(predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y)
Log loss on Test Data using Random Model : 0.888832408526019 Accuracy on Test Data using Random Model : 0.5072847682119205
alpha = [10 ** x for x in range(-5, 5)] # hyperparam for SGD classifier.
log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42,class_weight="balanced")
clf.fit(X_train_merged, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train_merged, y_train)
predict_y = sig_clf.predict_proba(X_test_merged)
log_error_array.append(log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
# print("\t\t Accuracy :", accuracy_score(y_test,predicted_y)*100)
For values of alpha = 1e-05 The log loss is: 0.0038003967310407173 For values of alpha = 0.0001 The log loss is: 0.003940539977052907 For values of alpha = 0.001 The log loss is: 0.004465209138230226 For values of alpha = 0.01 The log loss is: 0.005382347501666373 For values of alpha = 0.1 The log loss is: 0.007907343496749052 For values of alpha = 1 The log loss is: 0.015124555851035138 For values of alpha = 10 The log loss is: 0.018434823601526217 For values of alpha = 100 The log loss is: 0.018885417145215463 For values of alpha = 1000 The log loss is: 0.01884200701542584 For values of alpha = 10000 The log loss is: 0.018785643196478338
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.rcParams["figure.figsize"] = [10,7]
plt.show()
best_alpha = np.argmin(log_error_array)
alpha[best_alpha]
1e-05
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(X_train_merged, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train_merged, y_train)
predict_y = sig_clf.predict_proba(X_train_merged)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test_merged)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Accuracy :", accuracy_score(y_test,predicted_y)*100)
plot_confusion_matrix(y_test, predicted_y)
For values of best alpha = 1e-05 The train log loss is: 0.0028172964849306208 For values of best alpha = 1e-05 The test log loss is: 0.0034779763217249785 Accuracy : 99.96688741721854
accuracy = {}
roc_r = {}
def train_model(model, name):
# Checking accuracy
model = model.fit(X_train_merged, y_train)
pred = model.predict(X_test_merged)
acc = accuracy_score(y_test, pred)*100
accuracy[model] = acc
print('accuracy_score',acc)
print('precision_score',precision_score(y_test, pred)*100)
print('recall_score',recall_score(y_test, pred)*100)
print('f1_score',f1_score(y_test, pred)*100)
roc_score = roc_auc_score(y_test, pred)*100
roc_r[model] = roc_score
print('roc_auc_score',roc_score)
# confusion matrix
print('confusion_matrix')
plot_confusion_matrix(y_test,pred)
fpr, tpr, threshold = roc_curve(y_test, pred)
roc_auc = auc(fpr, tpr)*100
plt.figure(figsize=(4,4))
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
import _pickle as cPickle
# save the classifier
with open(str(name)+'.pkl', 'wb') as fid:
cPickle.dump(model, fid)
# # load it again
# with open('my_dumped_classifier.pkl', 'rb') as fid:
# gnb_loaded = cPickle.load(fid)
lr = LogisticRegression(C=5.0,class_weight="balanced", max_iter= 10000)
train_model(lr, 'log_reg_sklern')
accuracy_score 99.96688741721854 precision_score 100.0 recall_score 95.0 f1_score 97.43589743589743 roc_auc_score 97.5 confusion_matrix
knn = KNeighborsClassifier(weights='distance', algorithm='auto', n_neighbors=15)
train_model(knn, 'knn')
accuracy_score 99.76821192052981 precision_score 100.0 recall_score 65.0 f1_score 78.7878787878788 roc_auc_score 82.5 confusion_matrix
KNN
dtc = DecisionTreeClassifier(class_weight="balanced")
train_model(dtc, 'decision_tree')
accuracy_score 99.96688741721854 precision_score 100.0 recall_score 95.0 f1_score 97.43589743589743 roc_auc_score 97.5 confusion_matrix
Decision Tree
rfc = RandomForestClassifier(n_estimators=100,criterion='gini',class_weight="balanced")
train_model(rfc, 'random_forest')
accuracy_score 99.90066225165563 precision_score 100.0 recall_score 85.0 f1_score 91.89189189189189 roc_auc_score 92.5 confusion_matrix
Random Forest
scale_pos_weight = np.floor(y_train.value_counts()[0]/y_train.value_counts()[1])
xgb = XGBClassifier(scale_pos_weight = scale_pos_weight)
train_model(xgb, 'xgboost')
accuracy_score 99.96688741721854 precision_score 100.0 recall_score 95.0 f1_score 97.43589743589743 roc_auc_score 97.5 confusion_matrix
XGBOOST
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
from keras_tqdm import TQDMNotebookCallback
Using TensorFlow backend.
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
n_samples = len(y_train)
n_classes = 2
weight = n_samples / (n_classes * np.bincount(y_train))
inputDim = X_train_merged.shape[1]
class_weight = {0: np.ceil(weight[0]),
1: np.ceil(weight[1])}
class_weight
{0: 1.0, 1: 77.0}
epoch = 40
batch = 1000
# define the keras model
model = Sequential()
model.add(Dense(64, input_dim=inputDim, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
WARNING:tensorflow:From C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 64) 18304 _________________________________________________________________ dense_2 (Dense) (None, 32) 2080 _________________________________________________________________ dense_3 (Dense) (None, 8) 264 _________________________________________________________________ dense_4 (Dense) (None, 1) 9 ================================================================= Total params: 20,657 Trainable params: 20,657 Non-trainable params: 0 _________________________________________________________________
%%time
# fit the keras model on the dataset
history = model.fit(X_train_merged, y_train, epochs=epoch, batch_size=batch, class_weight=class_weight, verbose=0,
validation_data = (X_test_merged,y_test),)
WARNING:tensorflow:From C:\Users\Vijay\anaconda3\envs\py36\lib\site-packages\keras\backend\tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. Wall time: 7.13 s
# evaluate the keras model
_, accuracy = model.evaluate(X_train_merged, y_train)
print('Accuracy: %.2f' % (accuracy*100))
7046/7046 [==============================] - 0s 36us/step Accuracy: 100.00
# evaluate the keras model
_, accuracy = model.evaluate(X_test_merged, y_test)
print('Accuracy: %.2f' % (accuracy*100))
3020/3020 [==============================] - 0s 70us/step Accuracy: 99.97
# predictions = model.predict(X_test_merged)
predictions = model.predict_classes(X_test_merged)
y_classes = predictions
confusion_matrix(y_test.values, y_classes)
array([[3000, 0],
[ 1, 19]], dtype=int64)
plot_confusion_matrix(y_test.values, y_classes)
score = roc_auc_score(y_test.values, y_classes)
print("ROC Score : ", score)
ROC Score : 0.975
print('precision_score',precision_score(y_test.values, y_classes)*100)
print('recall_score',recall_score(y_test.values, y_classes)*100)
print('f1_score',f1_score(y_test.values, y_classes)*100)
precision_score 100.0 recall_score 95.0 f1_score 97.43589743589743
def loss_acc_plot(history=history):
plt.figure(figsize=(16, 5))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'],)
plt.plot(history.history['val_accuracy'],)
plt.title('Model Accuracy Plot')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='lower right')
plt.subplot(1,2,2)
plt.plot(history.history['loss'],)
plt.plot(history.history['val_loss'], )
plt.title('Model Loss Plot')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper right')
plt.show()
loss_acc_plot(history)
model.save('ann_model')
from prettytable import PrettyTable
# Specify the Column Names while initializing the Table
myTable = PrettyTable(["Classifier", "Accuracy", "AUC"])
# Add rows
myTable.add_row(["Random Model", "≈50", "-"])
myTable.add_row(["Logistic Regression->SGD", "99.96", "-"])
myTable.add_row(["Logistic Regression->SKLearn", "99.96", "97.5"])
myTable.add_row(["KNN", "99.76", "82.5"])
myTable.add_row(["Decision Tree", "99.96", "97.5"])
myTable.add_row(["Random Forest", "99.90", "92.5"])
myTable.add_row(["XGBoost", "99.96", "97.5"])
myTable.add_row(["Neural Network (ANN)", "99.96", "97.5"])
print(myTable)
+------------------------------+----------+------+ | Classifier | Accuracy | AUC | +------------------------------+----------+------+ | Random Model | ≈50 | - | | Logistic Regression->SGD | 99.96 | - | | Logistic Regression->SKLearn | 99.96 | 97.5 | | KNN | 99.76 | 82.5 | | Decision Tree | 99.96 | 97.5 | | Random Forest | 99.90 | 92.5 | | XGBoost | 99.96 | 97.5 | | Neural Network (ANN) | 99.96 | 97.5 | +------------------------------+----------+------+
Data is highly imbalanced.
All models are performing very good on accuracy because of imbalanced data.
These Models performance is equal and best,
Due to small dataset it seems models are overfitted, so we need more data to build more general predictive model.